DEV Community

Split Blog
Split Blog

Posted on • Originally published at split.io on

How to Compute the Confusion Matrix of Your Tests After Running One Hundred Experiments

After running about 100 experiments, you will have gone through a lot of conversations with stakeholders. One likely topic is whether the criteria for accepting a change are too strict. All your teams have put a lot of heart into a service improvement. In spite of that, for many, the impact isn’t as positive as expected. When the impact looks positive, it might not pass the threshold of significance—even after a long wait. If their changes end up not being significant, the most common recommendation is to ask the team to rethink their implementation. That disappointment will stress out your stakeholders.

With more than one team testing, product managers will share their disappointment with each other. They will bound over how hard the criteria seem. You will need clear answers to justify the significance threshold: Does it have to be 5%?

You might have decided that 5% is too tight and changed the threshold for false-positive rate when assuming the null hypothesis, a.k.a. the p-value, to 10%. You might have decided that only improvements matter and picked a one-tailed approach. You are not sure that this was a good idea. Is there a simple way to explain the reason to test significance? How can you tell what is an appropriate threshold for your organization? With enough tests in your dataset, it’s time to draw a confusion matrix to decide what to do with that seemingly abstract number.

To decide whether to make the criteria more or less strict, you can look at your last 100 experiments. Most likely, about 20 to 30 of them were deemed significant. Let’s say 30.

Gradually Estimating the Confusion Matrix

Ideally, that would mean that this is happening:

But that’s unlikely: with a 10% false-positive rate, about (70*10% = ) 7 of the experiments marked as significant were not a genuine improvement. We can’t change what we saw, so preserve the totals per column; we can transfer those presumed 7 results from one line to the other. Therefore 7 of the 30 seemingly significant results were presumably not real, and about 23 are reliable:

In addition to that, one would typically set their experiment power, i.e., the false-negative rate for the minimum detectable effect, to at least 80%.

That means that to have 23 significant and detected as such; you need to have about (23/80% =) 28 actual good ideas, (28-23 =) 5 of which failed to be significant, therefore:

Overall, this means that you probably had five tests that were good ideas but were among the 70 ignored because our testing process is imperfect, an estimated 7% false-negative rate.

If your confusion matrix looks like that, (5+7 =) 12 false results out of 100 might look alarming, but 88% correct identification is excellent!

All your teams are hard at work trying new releases. It will probably hurt their feelings to see the testing process snubbing them with so many of their excellent ideas. However, this balance corresponds to a reasonable compromise between speed of innovation and rejecting ideas that could be better.

A Pedagogical Tool

More importantly, this computation allows you to counter the false narrative that using a p-value of 5% means that “If the test result is significantly positive, there’s a 5% chance that it is not.” Of course, you can explain that this is not what p-value means (p-value is the chance that, if we A/B test a change with no actual impact, we consider it significant). However, it’s a lot harder without a clear answer to “What is the chance that my change corresponds to a positive effect?” and without a clear explanation to how 5% relates to that number.

The chances that a change isn’t as good as measured by the test is known as the false discovery rate (FDR). It’s rarely the same as the false-positive rate (FPR) when assuming a null hypothesis.

Rather than use long words and similar-sounding acronyms, show the detail of that computation. With the matrix, you can quickly compute: 7 / (7+23) = 23% and start discussing: A quarter of results marked as positive truly are. Does your organization’s experimental practices need to be stricter? With that matrix you can skip the confusing part, and focus on how to improve practices.

Are Those Numbers Exact?

The computation above is far from exact. Each experiment has a 10% chance of misclassifying one way, 20% the other. Those rates don’t mean that exactly 9 out of 90 are wrong—just that this (or one more or one less) is the most likely. You might have 7 or 12. That computation isn’t very reliable with fewer than a hundred tests—actually, fewer than a thousand even. But it gives a clear illustration why we limit the false-positive rate.

More importantly, a feature change might not be universally good or bad. It might be great for some users but not others; that balance might shift with time. Think of that estimation as a way to illustrate what the experiment policies (false-positive rate, power) are trying to control.

With Fewer Positive Numbers

Let’s assume that we see fewer experiments go through, say 13 out of 100, with the same false-positive rate (10%) and experiment power (20%):

Therefore, we likely had around (10% * 87 =) 9 false-positive results: that’s most of them.

Having most of your significant results being wrong is alarming. That’s why we try to limit how many false positives we allow, and 10% is probably a bit high in this case.

With an 80% experiment power, to have four positive tests, there might have been (4/80% – 4 =) 1 idea for a change that was effective but ended up being non-significant.

The above estimates also mean a low false-negative rate: 1:86. Therefore, most treatments whose test results appeared ineffective were indeed. However, you don’t have to forget all about them: they might be ideas based on sound principles, but the execution doesn’t do it justice.

When presenting experimentation, you mention that there are false negatives. That will lead to questions, especially when test results don’t go the way the team expected. Presenting overall results like that should give you arguments against releasing changes as there are, or more generally relaxing criteria: negative results are very likely actual negative.

My recommendation, in this case, would be to re-think the ideas, leverage more analytics, and spend more time talking to users to understand what would improve their experiences. You might want to limit your treatment to relevant users, for instance. For example, say you tried offering localization, but it didn’t enhance conversion for all customers: maybe most don’t need it. So instead, try focusing the suggestions on users in non-English speaking countries or whose browsers use a different default language.

Once that is done and more treatments lead to significant positive changes, you can think about having a stricter false positive rate. However, think first about improving your ideas and execution. Demanding more work before testing may slow you down at first, but hopefully, you will learn enough about your users to have better results. Being more appealing with a low false-positive rate means that more experiments will fail the test; this will be painful at first. However, as it stands, too many bad ideas are going through with a false positive rate as high as 10%. The test criteria that you are using are too lax. You want a clear signal of what is working.

With an Outstanding Significance Rate

Let’s assume that we see most experiments pass the test and are rolled out with the same false-positive rate and experiment power. This is very rare, so don’t feel embarrassed if this is not close to being the experience in your organization.

Therefore, with a false positive rate of 10%, we likely had very few false-positive results:

Having so many true positive results also means that, with 80% power, you might have had around (56/80% – 56 =) 14 actual good results that were non-significant.

14 / (14+26) = 35%. A third of the tests deemed not significant could be good ideas. That’s a lot. You probably want to leverage those promising ideas into meaningful changes. (56 + 14 =) 70 out of the 100 ideas that you tried could have improved your product. That’s an exceptionally high score!

Our recommendation would be to increase the required power to more than 80%. Most of your changes are positive, and many are likely to have a dramatically positive effect. Then, you might leverage that using more extended experiments. This way, you will collect a more precise, more robust signal. Correspondingly, you might want to have a stricter false positive rate, maybe down to 5%, to avoid rejecting good ideas too early.

Insufficient Information

These tables are rough estimates: you can’t tell from a summary like this which test was more convincing than it should be. They also don’t help much with less compelling results. Instead, they inform your decision to invest more time in investigating ideas before testing changes or restricting tests criteria.

How Can I Implement This with Split?

Hopefully, the math that we discussed so far is clear now. If you want to run it on your own data, you can copy our Confusion matrix estimation spreadsheet sample.

Learn More About Experimentation

After you’ve run several dozen A/B tests, you should pause to look at how likely you are to have a successful outcome and compute the false discovery rate. That would allow you to decide whether to make your experimentation practices stricter. In addition, it’s a great piece of context to have when choosing to set the false positive rate away from the default 5%.

I hope you enjoyed reading this. You might also enjoy:

Follow us on YouTube, Twitter, and LinkedIn for more great content! You should also join the Split Community on Slack.

Top comments (0)