It's a common task in NLP to either check a text against a pattern or extract parts from the text that matches a certain pattern. A regular expression or "regex" is a powerful tool to achieve this.
While powerful, regex can feel daunting as it comes with a lot of features and sub-parts that you need to remember.
In this post, I will illustrate the various concepts underlying regex. The goal is to help you build a good mental model of how a regex pattern works.
Mental Model
Let's start with a simple example where we are trying to find the word 'cool' in the text.
With regex, we could simply type out the word 'cool' as the pattern and it will match the word.
'cool'
While regex matched our desired word 'cool', the way it operates is not at the word level but the character level. This is the key idea.
Key Idea: Regex works at the character-level, not word-level.
The implication of this is that the regex r'cool'
would match the following sentences as well.
Basic Building Blocks
Now that we understand the key idea, let's understand how we can match simple characters using regex.
a. Specific character
We can simply specify the character in the regular expression and it will match all instances in the text.
For example, a regular expression given below will match all instances of 'a' in the text. You can use any of the small and capital alphabets.
'a'
You can also use any digits from 0 to 9 and it will work as well.
'3'
Note that regex is case-sensitive by default and thus the following regex won't match anything.
'A'
b. White space character
We can detect special characters such as whitespace and newlines using special escape sequences.
Besides the common ones above, we have:
- \r for carriage return
- \f for form feed
- \e for escape.
c. Special sequences
Regex provides a bunch of built-in special symbols that can match a group of characters at once. These begin with backslash \
.
Pattern: \d
It matches any single-digit number between 0 to 9.
Notice that matches are single digit. So we have 4 different matches below instead of a single number 18.04
.
Pattern: \s
It matches any whitespace character (space, tab or newline).
Pattern: \w
It matches any of the small alphabets(a to z), capital alphabets(A to Z), digits (0 to 9), and underscore.
Pattern: .
It matches any character except the new line (\n).
import re
>>> re.findall(r'.', 'line 1\nline2')
['l', 'i', 'n', 'e', ' ', '1', 'l', 'i', 'n', 'e', '2']
Pattern: Negations
If you use the capitalized versions of the patterns above, they act as negation.
For example, if "\d" matched any digits from 0 to 9, then "\D" will match anything except "0 to 9".
d. Character sets
These are patterns starting with [
and ending with ]
and specify the characters that should be matched enclosed by brackets.
For example, the following pattern matches any of the characters 'a', 'e', 'i', 'o', and 'u'.
You can also replicate the functionality of \d
using the below pattern. It will match any digits between 0 to 9.
Instead of specifying all the digits, we can use -
to specify only start and end digits. So, instead of [0123456789]
, we can do:
For example, [2-4]
can be used to match any digits between 2 to 4 i.e. (2 or 3 or 4).
You can even use the special characters we learned previously inside the brackets. For example, you can match any digit from 0 to 9 or whitespace as:
Below, I have listed some useful common patterns and what they mean.
e. Anchors
Regex also has special handlers to make the pattern only match if it's at the start or end of the string.
We can use the ^
anchor to match patterns only at the start of a line. For example:
Similarly, we can use the $
anchor after the character to match patterns only if it's the end of the line. For example:
f. Escaping metacharacters
Consider a case where we want to exactly match the word "Mr. Stark".
If we write a regex like Mr. Stark
, then it will have an unintended effect. Since we know dot has a special meaning in a regex.
So, we should always escape the special metacharacters like .
, $
etc. if our goal is to match the exact character itself.
Here is the list of metacharacters that you should remember to escape if you're using them directly.
^ $ . * + ? { } [ ] \ | ( )
Repetition of basic blocks
Now that we can pattern match any characters, we could repeat things and start building more complicated patterns.
a. Naive repetition
Using only what we have learned so far, a naive way would be to just repeat the pattern. For example, we can match two-digit numbers by just repeating the character-level pattern.
\d\d
b. Quantifiers
Regex provides special quantifiers to specify different types of repetition for the character preceding it.
i. Fixed repetition
We can use the {...}
quantifier to specify the number of times a pattern should repeat.
For example, the previous pattern for matching 2-digit number can be recreated as:
You can also specify a range of repetitions using the same quantifier. For example, to match from 2-digit to 4-digit numbers, we could use the pattern:
When applied to a sentence, it will match both 4-digit and 2-digit numbers.
Note:
There should not be any space between minimum and maximum count For example, \d{2, 4} doesn't work.
ii. Flexible quantifiers
Regex also provides quantifiers "*", "+" and "?" using which you can specify flexible repetition of a character.
-
0 or 1 times:
?
The?
quantifier matches the previous character if it repeats 0 or 1 times. This can be useful to make certain parts optional.For example, let's say we want to match both the word "sound" and "sound" where "s" is optional. Then, we can use the
?
quantifier that matches if a character repeats 0 or 1 times.
-
one or more times:
+
The+
quantifier matches the previous character if it repeats 1 or more times.For example, we could find numbers of any arbitrary length using the regex
\d+
. zero or more times:
*
The*
quantifier matches the previous character if it repeats zero or more times.
Usage in Python
Python provides a module called "re" in the standard library to work with regular expression.
Need for raw strings
To specify a regular expression in Python, we precede it with r to create raw strings.
pattern = r'\d'
To understand why we precede with r, let's try printing the expression \t without **r**
.
>>> pattern = '\t'
>>> print(pattern)
You can see how when we don't use raw string, the string \t
is treated as the escape character for tab by Python.
Now let's convert it into raw string. We get back whatever we specified.
>>> pattern = r'\t'
>>> print(pattern)
\t
Using re module
To use re
module, we can start by importing the re
module as:
import re
1. re.findall
This function allows us to get all the matches as a list of strings.
import re
re.findall(r'\d', '123456')
['1', '2', '3', '4', '5', '6']
2. re.match
This function searches for a pattern at the beginning of the string and returns the first occurrence as a match object. If the pattern is not found, it returns None.
import re
match = re.match(r'batman', 'batman is cool')
print(match)
<re.Match object; span=(0, 6), match='batman'>
With the match object, we can get the matched text as
print(match.group())
batman
In a case where our pattern is not at the start of the sentence, we will not get any match.
import re
match = re.match(r'batman', 'The batman is cool')
print(match)
None
3. re.search
This function also finds the first occurrence of a pattern but the pattern can occur anywhere in the text. If the pattern is not found, it returns None.
import re
match = re.search(r'batman', 'the batman is cool')
print(match.group())
batman
References
- A.M. Kuchling, "Regular Expression HOWTO - Python 3.9.0 documentation"
Connect
If you enjoyed the blog post, feel free to connect with me on Twitter where I share new blog posts every week.
Top comments (28)
Good write! When I explain regular expressions I tend to mention the following points:
I also point people towards RegExr, especially the 'Tests' tab is amazing, make sure you also write tests for the things you do not want to match.
Great to know. I'll update the post with a link to Regexer.
The greatest danger using regular expressions is a global search and replace. I can't tell you how many times I've seen inadvertent changes cause down stream problems.
This is a super clear guide for getting started with regex! Nicely done.
Well done! Next step: groups and backreferences :P
Addition for this writing:
?
equals{0,1}
*
equals{0,}
+
equals{1,}
All info with similar visualisations on regular-expressions.info/ .
Didn't know about this. It looks super helpful. Thank you for sharing.
Great tutorial. Also the highlighting of found matches is extremely helpful. 👍
I have learned Regex with regex101.com/ - I cannot recommend this tool enough. Helped me to master Regex. Before I was a complete noob and could not understand a single expression.
I wrote a simple Windows forms utility where I can test REs. One text box for the RE I'm testing, one where I start typing in text and the utility responds with an indication of a match as I type each character.
Duh - how embarrassing is that: I've used regex for years - and only now have I learned about \D as a negation of \d! THANKS! (Can't wait to read in full to see what else I missed...)
hey @amitness cool guide for difficult stuff like RegEx. 👏
I think you got a typo in the 2nd last example.
Why should "r'batman'" in
should match only match for the occurrence at the beginning?
re.match() and re.search() work differently: stackoverflow.com/questions/180986...
Awesome how you created this guide. It's so pretty!! How do you make these diagrams?
Thank you.
I created the diagrams using excalidraw.com
Good post. Nice to read.
Could you go further explaining capturing group with
()
?Nice post on Regex on python. Your explanations are so good. Awesome.
This is a really cool reference, thanks!