DEV Community

Cover image for A Visual Guide to Regular Expression
Amit Chaudhary
Amit Chaudhary

Posted on • Originally published at amitness.com

A Visual Guide to Regular Expression

It's a common task in NLP to either check a text against a pattern or extract parts from the text that matches a certain pattern. A regular expression or "regex" is a powerful tool to achieve this.

While powerful, regex can feel daunting as it comes with a lot of features and sub-parts that you need to remember.

In this post, I will illustrate the various concepts underlying regex. The goal is to help you build a good mental model of how a regex pattern works.

Mental Model

Let's start with a simple example where we are trying to find the word 'cool' in the text.

With regex, we could simply type out the word 'cool' as the pattern and it will match the word.

'cool'
Enter fullscreen mode Exit fullscreen mode

While regex matched our desired word 'cool', the way it operates is not at the word level but the character level. This is the key idea.

Key Idea: Regex works at the character-level, not word-level.

The implication of this is that the regex r'cool' would match the following sentences as well.

Basic Building Blocks

Now that we understand the key idea, let's understand how we can match simple characters using regex.

a. Specific character

We can simply specify the character in the regular expression and it will match all instances in the text.

For example, a regular expression given below will match all instances of 'a' in the text. You can use any of the small and capital alphabets.

'a'
Enter fullscreen mode Exit fullscreen mode

You can also use any digits from 0 to 9 and it will work as well.

'3'
Enter fullscreen mode Exit fullscreen mode

Note that regex is case-sensitive by default and thus the following regex won't match anything.

'A'
Enter fullscreen mode Exit fullscreen mode

b. White space character

We can detect special characters such as whitespace and newlines using special escape sequences.

Besides the common ones above, we have:

  • \r for carriage return
  • \f for form feed
  • \e for escape.

c. Special sequences

Regex provides a bunch of built-in special symbols that can match a group of characters at once. These begin with backslash \.

Pattern: \d

It matches any single-digit number between 0 to 9.

Notice that matches are single digit. So we have 4 different matches below instead of a single number 18.04.

Pattern: \s

It matches any whitespace character (space, tab or newline).

Pattern: \w

It matches any of the small alphabets(a to z), capital alphabets(A to Z), digits (0 to 9), and underscore.

Pattern: .

It matches any character except the new line (\n).


import re

>>> re.findall(r'.', 'line 1\nline2')
['l', 'i', 'n', 'e', ' ', '1', 'l', 'i', 'n', 'e', '2']
Enter fullscreen mode Exit fullscreen mode

Pattern: Negations

If you use the capitalized versions of the patterns above, they act as negation.

For example, if "\d" matched any digits from 0 to 9, then "\D" will match anything except "0 to 9".

d. Character sets

These are patterns starting with [ and ending with ] and specify the characters that should be matched enclosed by brackets.

For example, the following pattern matches any of the characters 'a', 'e', 'i', 'o', and 'u'.

You can also replicate the functionality of \d using the below pattern. It will match any digits between 0 to 9.

Instead of specifying all the digits, we can use - to specify only start and end digits. So, instead of [0123456789], we can do:

For example, [2-4] can be used to match any digits between 2 to 4 i.e. (2 or 3 or 4).

You can even use the special characters we learned previously inside the brackets. For example, you can match any digit from 0 to 9 or whitespace as:

Below, I have listed some useful common patterns and what they mean.

e. Anchors

Regex also has special handlers to make the pattern only match if it's at the start or end of the string.

We can use the ^ anchor to match patterns only at the start of a line. For example:

Similarly, we can use the $ anchor after the character to match patterns only if it's the end of the line. For example:

f. Escaping metacharacters

Consider a case where we want to exactly match the word "Mr. Stark".

If we write a regex like Mr. Stark, then it will have an unintended effect. Since we know dot has a special meaning in a regex.

So, we should always escape the special metacharacters like ., $ etc. if our goal is to match the exact character itself.

Here is the list of metacharacters that you should remember to escape if you're using them directly.

^ $ . * + ? { } [ ] \ | ( )
Enter fullscreen mode Exit fullscreen mode

Repetition of basic blocks

Now that we can pattern match any characters, we could repeat things and start building more complicated patterns.

a. Naive repetition

Using only what we have learned so far, a naive way would be to just repeat the pattern. For example, we can match two-digit numbers by just repeating the character-level pattern.

\d\d
Enter fullscreen mode Exit fullscreen mode

b. Quantifiers

Regex provides special quantifiers to specify different types of repetition for the character preceding it.

i. Fixed repetition

We can use the {...} quantifier to specify the number of times a pattern should repeat.

For example, the previous pattern for matching 2-digit number can be recreated as:

You can also specify a range of repetitions using the same quantifier. For example, to match from 2-digit to 4-digit numbers, we could use the pattern:

When applied to a sentence, it will match both 4-digit and 2-digit numbers.

Note:

There should not be any space between minimum and maximum count For example, \d{2, 4} doesn't work.

ii. Flexible quantifiers

Regex also provides quantifiers "*", "+" and "?" using which you can specify flexible repetition of a character.

  • 0 or 1 times: ?

    The ? quantifier matches the previous character if it repeats 0 or 1 times. This can be useful to make certain parts optional.

    For example, let's say we want to match both the word "sound" and "sound" where "s" is optional. Then, we can use the ? quantifier that matches if a character repeats 0 or 1 times.

  • one or more times: +

    The + quantifier matches the previous character if it repeats 1 or more times.

    For example, we could find numbers of any arbitrary length using the regex \d+.

  • zero or more times: *

    The * quantifier matches the previous character if it repeats zero or more times.

Usage in Python

Python provides a module called "re" in the standard library to work with regular expression.

Need for raw strings

To specify a regular expression in Python, we precede it with r to create raw strings.

pattern = r'\d'
Enter fullscreen mode Exit fullscreen mode

To understand why we precede with r, let's try printing the expression \t without **r**.

>>> pattern = '\t'
>>> print(pattern)

Enter fullscreen mode Exit fullscreen mode

You can see how when we don't use raw string, the string \t is treated as the escape character for tab by Python.

Now let's convert it into raw string. We get back whatever we specified.

>>> pattern = r'\t'
>>> print(pattern)
\t
Enter fullscreen mode Exit fullscreen mode

Using re module

To use re module, we can start by importing the re module as:

import re
Enter fullscreen mode Exit fullscreen mode

1. re.findall

This function allows us to get all the matches as a list of strings.

import re
re.findall(r'\d', '123456')
Enter fullscreen mode Exit fullscreen mode
['1', '2', '3', '4', '5', '6']
Enter fullscreen mode Exit fullscreen mode

2. re.match

This function searches for a pattern at the beginning of the string and returns the first occurrence as a match object. If the pattern is not found, it returns None.

import re

match = re.match(r'batman', 'batman is cool')
print(match)
Enter fullscreen mode Exit fullscreen mode
<re.Match object; span=(0, 6), match='batman'>
Enter fullscreen mode Exit fullscreen mode

With the match object, we can get the matched text as

print(match.group())
Enter fullscreen mode Exit fullscreen mode
batman
Enter fullscreen mode Exit fullscreen mode

In a case where our pattern is not at the start of the sentence, we will not get any match.

import re

match = re.match(r'batman', 'The batman is cool')
print(match)
Enter fullscreen mode Exit fullscreen mode
None
Enter fullscreen mode Exit fullscreen mode

3. re.search

This function also finds the first occurrence of a pattern but the pattern can occur anywhere in the text. If the pattern is not found, it returns None.

import re

match = re.search(r'batman', 'the batman is cool')
print(match.group())
Enter fullscreen mode Exit fullscreen mode
batman
Enter fullscreen mode Exit fullscreen mode

References

Connect

If you enjoyed the blog post, feel free to connect with me on Twitter where I share new blog posts every week.

Top comments (28)

Collapse
 
eecolor profile image
EECOLOR

Good write! When I explain regular expressions I tend to mention the following points:

  • Once you learn regular expressions, please remember that you barely ever need them. Things tend to look like regex nails ;-)
  • If your regular expression starts to look back at you in an intimidating fashion, you are probably better off using something else.
  • You should not use a regular expression for things that involve nesting (code, html, ...)

I also point people towards RegExr, especially the 'Tests' tab is amazing, make sure you also write tests for the things you do not want to match.

Collapse
 
amitness profile image
Amit Chaudhary

Great to know. I'll update the post with a link to Regexer.

Collapse
 
dougaws profile image
Doug

The greatest danger using regular expressions is a global search and replace. I can't tell you how many times I've seen inadvertent changes cause down stream problems.

Collapse
 
adamcoster profile image
Adam Coster

This is a super clear guide for getting started with regex! Nicely done.

Collapse
 
mroeling profile image
Mark Roeling • Edited

Well done! Next step: groups and backreferences :P

Addition for this writing:
? equals {0,1}
* equals {0,}
+ equals {1,}

All info with similar visualisations on regular-expressions.info/ .

Collapse
 
amitness profile image
Amit Chaudhary

Didn't know about this. It looks super helpful. Thank you for sharing.

Collapse
 
q2apro profile image
q2apro

Great tutorial. Also the highlighting of found matches is extremely helpful. 👍

I have learned Regex with regex101.com/ - I cannot recommend this tool enough. Helped me to master Regex. Before I was a complete noob and could not understand a single expression.

Collapse
 
dougaws profile image
Doug

I wrote a simple Windows forms utility where I can test REs. One text box for the RE I'm testing, one where I start typing in text and the utility responds with an indication of a match as I type each character.

Collapse
 
mbaas2 profile image
Michael Baas

Duh - how embarrassing is that: I've used regex for years - and only now have I learned about \D as a negation of \d! THANKS! (Can't wait to read in full to see what else I missed...)

Collapse
 
kmonsoor profile image
Khaled Monsoor • Edited

hey @amitness cool guide for difficult stuff like RegEx. 👏

I think you got a typo in the 2nd last example.
Why should "r'batman'" in

match = re.match(r'batman', 'The batman is cool')

should match only match for the occurrence at the beginning?

Collapse
 
ardunster profile image
Anna R Dunster

re.match() and re.search() work differently: stackoverflow.com/questions/180986...

Collapse
 
tiesmaster profile image
Thijs Brobbel

Awesome how you created this guide. It's so pretty!! How do you make these diagrams?

Collapse
 
amitness profile image
Amit Chaudhary

Thank you.

I created the diagrams using excalidraw.com

Collapse
 
dotorimook profile image
dotorimook

Good post. Nice to read.
Could you go further explaining capturing group with ()?

Collapse
 
josiasaurel profile image
Josias Aurel

Nice post on Regex on python. Your explanations are so good. Awesome.

Collapse
 
aucacoyan profile image
AucaCoyan

This is a really cool reference, thanks!