A practical guide to A/B testing with multiple comparisons

Contact our team

Let's check your website speed.

Share blog post

Everyone benefits from speed.

https://speedkit.com/blog/a-practical-guide-to-a-b-testing-with-multiple-comparisons

Why do we use A/B tests?

At Speed Kit, we dream of a world where loading times don't exist, and we're working hard every day to make that dream a reality. We are already enabling our customers to improve their web vitals, and we prove it continuously by performing A/B tests!

A/B tests are robust statistical tools that ensure that we meet our quality criteria. It allows us to provide insights that are:

Objective
Trustworthy
Clear

In this article, we will explain the fundamentals of A/B testing and we will also focus on how to conduct A/B tests with more than two groups.

Disclaimer: A/B testing might seem simple on a theoretical standpoint but it can be less trivial in practice. While implementing Speed Kit on our customers' websites, we often run into some tricky and complex challenges especially in terms of tracking and data quality.Even though we are convinced that A/B tests are Indispensable to prove the value of our product, the amount of work necessary to do it properly should not be neglected.

Let’s say that we want to prove that the average LCP (Largest Contentful Paint) of one of our customer has been impacted positively by Speed Kit. In that case we would create an A/B test.

LCP is one of the three Core Web Vitals. It specifically measures the loading performance of a webpage by recording the time it takes for the largest visible content element (usually an image, video, or large block of text) to load and render within the viewport.
‍

How to perform an A/B test?

Without going into too much detail, let's take a quick look at how AB tests are structured. If you want more information, you can read this article.

Formulating a hypothesis

A/B tests are designed to invalidate some hypothesis based on data using statistical formulas.

The first step for an A/B test is to formulate an hypothesis. Here, our goal is to prove that Speed Kit had an impact on Largest Contentful Paint of our customer. So let us formulate the first hypothesis, our H1:

H1: Speed Kit has an impact on our customer’s LCP.

As we just explained, A/B tests are made in a way that they only can be used to invalidate hypothesis. So we can also try to invalidate the opposite hypothesis, H1. We call this hypothesis H0 (or the null hypothesis):

H0: Speed Kit has no impact on our customer’s LCP.

Let’s now prove that H0 is False.

What data are we using?

In order to do all these computations, we use data that we collect from our customer’s website. We call it our RUM (Real User Monitoring) data.

It is a very precise and reliable source that has been tested with many customers and ensures better precision than most of our customers tracking sources. We of course also verify that our data has similar values as our customer’s tracking data so that we can all agree on the values before we perform an A/B test.

Our RUM data ensures a great precision and helps us to deeply understand the root causes of potential web performance issues in our customer’s website so that we can support them in the best way possible.

Assigning test groups

Each session on our customer’s website will be randomly assigned to one of the following groups:

Speed Kit group: The sessions with this assigned group will benefit from the improvements brought by Speed Kit.
Control group: The sessions from this group will behave normally, without any interaction from Speed Kit. They will not be accelerated.

It is essential to make sure that the groups are well balanced and that we do not have any bias in the way the groups are attributed.

After a few weeks, we will be able to compute some metrics that will provide some details about the value that Speed Kit has been delivering. This article will not focus on how to compute the different metrics generated by an A/B test.

‍

How to analyze your A/B test results?

In order to analyze an A/B test result, we have to focus on 3 metrics. They will help us to figure out if our test is truly significant.

The metric lift

In our case, we are interested in knowing if we improved the LCP (Largest Contentful Paint) of our customer. We are comparing the average LCP in both groups and comparing the results to get the uplift.

Uplift = ((x̄ Speed Kit - x̄ Control) / x̄ Control) * 100

Metric lift = X% → The Speed Kit group has seen its average LCP improve from X% compared to the control group, on the same period of time.

Error types - explained

Before I can introduce the next metrics, let’s have a small statistics lesson about testing errors. While the terminology may seem complex, the concept itself is straightforward. There are two type of errors for a test:

Type I errors (false positives): When a test is presenting a type I error, it means that we think that H1 is true, even though it is not. For example, a covid test will have a type I error if it says that you have covid, even though you actually do not have it.
Type II errors (false negatives): When a test is presenting a type II error, it means that we think that H1 is false, even though it is actually true. It would mean that we fail to show the effect of Speed Kit, even though there is one. If we consider a covid test, a type II error would be to say that someone is not infected even though they are.

Error types example of a black stickman detector

P-value

The p-value is a metric that we get when we perform an A/B test. It is quite important because it is telling you how reliable your test actually is, it indicates the chance of getting type I error.

If you have a p-value of 3%, it means that for 2 similar groups, there would be 3% chance that we get a the same lift as we measured, completely due to randomness.

By convention, we want to keep the possibility of type I errors under 5%. Therefore we need the p-value to be smaller than 5%.

Power

The statistical power is very similar to the p-value, but it is used to measure the chance of getting type II error.

It is often perceived as less critical than the p-value and we reach acceptable values quite fast as long as we have enough data points. It is nevertheless an important indicator that we want to monitor. By convention we want it to be above 80%.
‍

Why do we have to be careful with multiple comparison tests?

In most of the cases, as long as the p-value is lower than 5% and the power higher than 80%, we can conclude that the test is a success, however it gets trickier when we are running complex experiments with multiple comparisons.

Case study: testing new features

Let’s imagine that we want to do a special test. We would like to see if Speed Kit is helping to decrease the LCP of our customer, but we have 2 different configurations that we want to try:

The regular Speed Kit product
A new version of Speed Kit with a new beta feature that we want to test with our customer

We will therefore have 2 parallel A/B tests:

Test A: Control vs. Speed Kit
Test B: Control vs. Speed Kit + beta

We can run the 2 A/B tests separately and get each metric twice (LCP lift, p-value, power)

We want to test several versions of Speed Kit

The core problem

The issue here lies in our definition of a success:

If the regular Speed Kit group and/or the experimental Speed Kit group are showing a significant uplift, then we would say that we were successful.

Our test is a sort of “super-test”, depending on the results of the 2 tests described before. Let’s imagine that we have these p-values:

Test A: p-value = 4.2%
Test B: p-value = 3.9 %

Let’s call P1 the probability of having at least a type I error in our test. P1 has to be equal to the sum of the probabilities of having a type I error in each of the tests.

P1 = P(type I error in test A) + P(type I error in test B)
P1 = 4.2% + 3.9%
P1 = 8.1%

As we see, we can have two separate tests that have a p-value lower than 5%, but it can not ensure that we have a p-value respecting the same standard for the whole experience.

‍

How can we solve the multiple comparisons problem?

This problem is well known from statisticians and it is described as the “multiple comparison” problem. Thankfully there are a couple of solutions to correct it. Today, we will focus on the Bonferroni correction, a method that reduces the likelihood of falsely identifying significant results when conducting multiple tests.

The Bonferroni correction is extremely simple and it is also very conservative. This indicates that we are no doing any compromise about the robustness of our test.

Bonferroni correction

The Bonferroni correction is extremely simple to compute. It just requires one to divide the p-value threshold of every A/B tests by the number of tests that are run.

In our example, it means that we do not accept the validity of test A and B as long as their p-value is above 5%/2 = 2.5%.

The concept is very simple: by dividing the threshold of each test by the number of tests, we ensure that the sum of p-values (and therefore, the global threshold) is never reaching above 5%.

What is not covered by Bonferroni method

It is important to clarify what is not concerned by the Bonferroni correction because it is often a source of confusion.

Independent tests: If the two A/B tests are completely independent of each other, meaning they are testing different hypotheses or in different contexts where the outcomes do not influence each other, the need for Bonferroni correction is reduced. The primary concern for multiple comparisons is when tests are correlated and the results could influence each other.
Different metrics: If the A/B tests are evaluating different metrics that do not interact or influence each other, the correction may not be necessary. For example, one test might be measuring user web performance metrics while the other measures bounce rates. Since these are different outcomes, the risk of cumulative Type I error might be less significant.
Exploratory analysis: If the A/B tests are part of an exploratory analysis rather than confirmatory research, some researchers argue that strict corrections like Bonferroni may be too conservative. The focus in exploratory analysis is often on identifying potential trends rather than making definitive conclusions.
Sequential testing: If A/B tests are conducted sequentially and decisions are made based on the outcome of the previous test, the context might not require Bonferroni correction. Each test is evaluated in isolation as part of an iterative process.
Cost of type II errors: In some cases, the cost of missing a potentially true effect (Type II error) is considered more critical than a false positive (Type I error). In such scenarios, researchers might accept a slightly higher risk of Type I error without applying the Bonferroni correction to avoid missing out on significant findings.
‍

Conclusion

A/B testing is our secret sauce to proving Speed Kit’s value, one hypothesis at a time. By setting up clear tests and analyzing metrics like LCP lift, p-value, and power, we ensure our product delivers real results. With each partnership, we confidently contribute to making the web faster!