Recently I've been doing quite a lot of work using Regex and to be honest, I was pretty shite at it. It's one of those things that has been on my to-learn for years, but I've never really got round to it, because - like most people - I google a regex problem and see how other people have solved it rather than doing it myself.
What is regex
So if you don't know what regex is, don't worry this blog should be suitable for devs at all levels! Regex is short for Regular Expression which is a sequence of characters that specifies a search pattern. We use it reasonably frequently in programming to search for patterns in strings either to pull the information out or to make changes to it.
Let's get started
In this blog I recommend you open up https://regexr.com/ which is an online tool to evaluate regex, allowing you to put in test strings to see if your regex is working. Follow along with me making sure to use your own test strings to really solidify your understanding.
Slashes
The first thing you'll see is the slashes, one at the start and one at the end, they simply indicate when a regex starts and when a regex ends.
Flags
After the end slash you may have noticed there is usually a letter at the end. These are called flags, there are 6 flags:
- g - global search, meaning it'll match all occurrences, this is the most common and will be used for the rest of the blog
- i - ignore case, meaning case-insensitive
- m - multiline mode, which will make more sense when we cover the ^ and $ below.
u, y, s are the others but won't be covered in this blog as I I've never really seen them used so I wouldn't consider them a learning priority. If you are interested, you can learn more here.
Exact matching
To start off with let's begin with exact character matching, let's say we want to match on the word Panic, all we have to do is write the thing.
/Panic/g
Which is probably what you want most of the time, and if it is, you likely don't need regex. But, anything a little more involved will require quantifiers, which we'll get onto.
Quantifiers
The +
The plus character + means match one or more of the preceding token. So in our example below, match the letter l one or more times.
/l+/g
The ?
The question mark ? is the optional symbol, it means anything that is before it is optional. So in our example, we are searching for i and f if it exists, but if f doesn't exist don't worry yourself, we'll take the i. So the optional-ness is just the first token that precedes the ?, which is f.
/if?/g
The *
The *
is basically a combination of the +
and the ?
. It matches 0 or more of the preceding token. So it's optional and if there are multiple of the preceding token we'll take that too. So in our example, we will match on anything that has an e and zero or many ls.
/el*/g
The .
The dot matches any characters except line breaks. In the example below, we are looking for any character that has a space after it.
/. /g
The \w
So you might look at this and think oh god this is why regexes are so complicated, what's the bloody slash for? Well, if we just put in 'w' then it'd search for the 'w' letter as seen above, so this allows us to search just for a word's character. So any alphanumeric or underscore.
Our example matches on any word character then the letter s:
/\ws/g
The \s
The \s quantifier is very simple, it just matches on any whitespace character (tabs, spaces, line breaks):
/\s/g
The \S
Matches anything that isn't whitespace:
This is true for other quantifiers too, so when they're capitalised it means the opposite. So for \d that matches on digits there is \D that matches on anything that isn't a digit, and \W matches on anything that isn't a word character.
The |
The pipe means or, so search for this or that. So if you want to search for some literal text let's say "happy" but we also want to grab all the occurrences of the word "day" then we could just put those either side of a pipe and we'll match on both!
/happy|day/g
The ^
This symbol denotes the start of the string, or, if the multiline tag is enabled then it denotes the start of the line.
Example without the multiline tag:
/^the/g
Notice how it doesn't grab the "the" before Galaxy as it is not at the start of the string, and also how it doesn't grab the "the" on the second line, because that's also not the start of the string. However, if we enable multiline:
/^the/gm
Then a match is made at the start of both lines.
The $
This denotes the end of the string, so with multiline it'd mean the end of the line, in the example below I'm not using multiline:
/Thursday$/g
Notice how it doesn't match on the other Thursday as that's not at the end of the string, if you got rid of the $ then it certainly would match!
Length matching
Specifying a length in regex is pretty easy, all you have to do is choose your desired token and then follow it with some curly braces and pop a number in it. Below means give us any word characters that are 8 in length:
/\w{8}/g
You can also specify a min and max range:
/\w{4,6}/g
So any word character that is at least 4 characters long and at most 6.
Character sets
Character sets allow you to plonk stuff in square brackets which basically says match on anything that is within these square brackets. So the following means match any lowercase vowels:
/[aeiou]/g
And this means match any vowels that are followed by the letter r:
/[aeiou]r/g
You can also add ranges in character sets, the following means match on any lowercase characters in the alphabet:
/[a-z]/g
We want to grab the capital letters too, let's add that range as well.
/[a-zA-Z]/g
Another really common range you might see in regexes is [0-9]
Groups
Putting parentheses around part of your regex allows you to group it. This basically means that you can apply a quantifier to the entire group rather than just the preceding token.
The following means to find words that contain "the" with an upper or lowercase t.
/(t|T)he/g
Or the following example, which is looking for the letter p twice in the group and then an e after, but then notice the question mark after the group which denotes the preceding token is optional, so everything within the group is optional. Meaning e is chosen in cases where there's no preceding pp.
/(pp)?e/g
Referencing Groups
In some scenarios, you may want to split your regex up into groups so that you can later reference them individually. Let's take the following example:
/(\d)\. (.+)/g
This regex, first of all, looks for a number which is our first group, then a . (notice we had to escape the . else it would match on any character as shown previously). The second group (.+) is looking for any character except line breaks 1 or more.
Using the replace tools on https://regexr.com/ you can easily reference your groups. To reference the first group type in $1:
And then to reference the second group type in $2
Non-capturing groups
Let's say you want to capture the name of the book but don't really care about the number. So when you reference the group by using $1 it'll just return you the name. To do this we can create non-capturing groups, so within your brackets you can just plonk a ?: at the start to say, I don't want to include this in the result.
/(?:\d)\. (.+)/g
It looks the same right? But not when outputting the groups:
Notice how $2 is now black and is treated as though you want to append $2 as a string on the end because it's not defined.
Lookaheads/ lookbehinds
Okay, so you've learned pretty much everything you'll need to know about regex in 95% of scenarios, let's learn about lookaheads and lookbehinds.
Positive Lookbehind ?<=
Matches a group before an expression without including it in the result.
Carrying on from the previous example let's take the same input and use positive lookbehind to find all occurrences of the letter h that are preceded by the letter t.
/(?<=t)h/g
So positive lookbehind is great for looking for something that precedes a pattern but you don't really care about having that pattern in the result.
Negative Lookbehind ?<!
Specifies a group that cannot match before the main expression, so this is basically the inverse of positive lookbehind. All we do is swap the =
for a !
Let's do that for the example we used in positive lookbehind:
/(?<!t)h/g
So this finds all occurrences of the letter h that are not preceded by the letter t.
Positive Lookahead ?=
Matches a group after an expression without including it in the result. So exactly the same as positive lookbehind but from the other end of the string.
In the example below I'm looking for any character that has an a following it:
/.(?=a)/g
Negative Lookahead
Is just the inverse of positive lookahead. Specifies a group that cannot match after the main expression.
So the following looks for any character that is not followed by the letter a:
/.(?!a)/g
Let's tackle a problem together
A real common regex problem is phone numbers. Copy the following text into RegExr.
L. Garrigan: 248-555-1234
J. Bezos: 232 123 2312
E. Musk: (313) 555-1234
B. Gates: (810)555-1234
Goal: get the numbers for each person and output in the format xxx-xxx-xxxx
Let's pick the problem apart. Let's search for numbers that are in the correct format, so L.Garrigan's number is in the correct format, let's grab that.
/\d{3}-\d{3}-\d{4}/g
So we're looking for 3 digits followed by a - then another 3 digits followed by a - then 4 digits
Next let's try grab the phone number with spaces rather than dashes.
/\d{3}-?\s?\d{3}-?\s?\d{4}/g
So I've made the - optional and also added an optional space denoted by \s?
I'm pretty sure this can be improved though using character sets rather than two optionals:
/\d{3}[- ]\d{3}[- ]\d{4}/g
I decided to go for [- ] which means match either - or a space, I chose the space instead because \s also matches on tabs and line breaks which I don't need.
Let's go for the third row now with the first numbers wrapped in brackets:
/\(?\d{3}\)?[-\s]\d{3}[- ]\d{4}/g
I've just added an optional bracket ( at the start - notice it's escaped else it'd think we were creating a group and then an optional closing bracket )? after the first 3 digits.
Next, let's go for Gates, who helpfully decided to not put a space after his parenthesis.
/\(?\d{3}\)?[-\s]?\d{3}[- ]\d{4}/g
So I just make the first character set optional which allows us to grab those.
We need to group these so we can output all the information in the desired format. So I'm going to group the first 3 numbers, then the second 3, then the last 4.
So it is just a case of wrapping parenthesis around the digit captures:
/\(?(\d{3})\)?[-\s]?(\d{3})[- ](\d{4})/g
And then output should look like the following:
Wonderful, all in that lovely format we wanted.
Note there's so many edge cases here, I'd recommend checking out this article which covers the problem in more detail.
Tackle some problems!
The only real way to get decent at anything is by practicing, so to really solidify what you've learnt in this blog have a crack at some problems on here it's a really good website that'll take you through some regex problems dependent upon your ability. It takes into consideration the quantifiers you know and don't know and designs tests accordingly.
I hope you've enjoyed this blog, be sure to check out my personal blogging site at codeheir.com for some extra free content!
Top comments (3)
Regex is often used in web scraping and the best thing it's relatively easy to learn, so why not to give it a try? Thanks for the useful article! I want to automate my extraction for achieveessays.com/ data.
Found this very helpful. Thanks!
Glad it helped!