Interestingly, CloudFlare went down for a while yesterday. The post-mortem explains that the whole situation is due to a regex deployed to the WAF globally which caused a CPU overload.
How comes? Well, we don't know because the regexp isn't publicly listed, however the guess is pretty simple to make if you know that regexps have a huge drawback.
Note — This is just a guess!
One typical hard case for regex engines is a?a?a?aaa
. For 3 a
like here it's easy but when you reach 20 it's already very complicated. See the following Python code for an example.
The output I get is
re.compile('a?a') = 8.5528998170048e-05
re.compile('a?a?a?a?a?aaaaa') = 0.0009152740021818317
re.compile('a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa') = 0.009123567000642652
re.compile('a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaa') = 0.4090354809995915
The graph (stolen from the article I'm going to link later on in this post) looks like this:
As you can see, it's absolutely not proportional. Actually it becomes slow beyond a dozen question marks and completely crazy around twenty.
So what's happening here? Since the regex engine is backtracking, it will create a different branch every time you have a question mark. It means that it will explore all the different ways that the string could be matching. The complexity of this is factorial!
The good news is that you don't have to make slow regular expressions. There's nothing wrong in the way you write them, it's rather a question of the algorithm behind. You might have heard of Ken Thompson.
He invented a way to do regular expressions now dubbed Thompson NFA. Long story short, they turn the seconds seen before into microseconds!
Is it magic?
It's simply a very smart optimization that slashes the complexity into something linear.
How does it work?
It is very long to explain and there is a very good article on the matter that already exists so I'm not going to plagiarize it.
Why isn't it used today?
Because some features of modern regular expression engines forbid it. But most of regular expressions don't use those features so modern engines could have two engines including one Thompson-NFA-based when possible to avoid explosive complexity. Again, see the article.
Can I use it today?
Sure, there is plenty if implementations out there. However, none is mainstream to my knowledge. See for Python or JavaScript by example.
Should I use it today?
It depends:
- Do you do something fancy?
- Is it performance-sensitive?
- How much control do you have over the expressions that are inserted?
If you're CloudFlare and that your regexps will be run on zillions of HTTP queries, that they look for super-weird security patterns and that are created by dozens of engineers then you should probably use it.
If you're validating an email in a form, you should probably not care.
Conclusion
Regular expression engines are flawed because their complexity can completely explode above the roof. As CloudFlare showed it, the repercussions can be dramatic.
The good news is that there is alternative engines that will not suffer this kind of issues. On the other hand they are not mainstream so it's a bit complicated to integrate because it will increase the maintenance surface.
It means that you have to know the flaw and every time you use a regexp you need to choose the proper engine for it. Then you might avoid a lot of drama!
Top comments (5)
Great article Remy! It's absolutely absurd how slow regex is! 😅
e.g. Even the simplest of matches are left in the dust. jsperf.com/slice-vs-startswith-vs-...
Thanks!
One remark though on your benchmark, I think that compile time of regexes is not excluded from the timing, which would be fair as if done correctly it will only impact the first execution of your code.
I'm saying that but I'm totally unsure of how JS handles this :)
Oh that's a really good point! I'll check that out.
So I added a few new tests:
*The RegExp instance is declared outside the test loop.
I'm seeing that #1 is significantly faster than the String.match() test where the regular expression is provided inline. The difference seems to be (very roughly) in the magnitude of 4,769,893 ops/sec.
The difference between #2 and #3 is significantly smaller and in favor of the inline test. The (very) rough difference seems to be around 121,562 ops/sec.
So, perhaps the String.match() method has some optimizations for handling inline regular expressions that aren't bogged down by the usage of an explicit RegExp instance.
jsperf.com/slice-vs-startswith-vs-...
You are cruelty