Introduction to A/B testing, test statistic and p-values

Jonathan Bossio
Sep 8, 2023
3 min read

Updated: Oct 31, 2023

In this post, I will provide a brief introduction to the following concepts:

A/B testing
Significance level
Statistical power
p-value
Test statistic

1. A/B testing

A/B testing is a statistical method that enable us to reject a (null) hypothesis (i.e. assumption) about a metric, such as a mean or a proportion. An A/B test is a randomized experiment in a controlled environment. A/B testing consists on calculating a test statistic, which measures how far the data deviates from the null hypothesis.

There are two types of errors to keep in mind while doing A/B tests:

The type I error (also known as false positive) is rejecting the null hypothesis while the null hypothesis is actually true. The probability of type I error is denoted as alpha and also called the significance level. For example, our test result says that we observe a significant difference on a given metric between two groups, but there is actually no significant difference between them.
The type II error (also known as false negative) is accepting the null hypothesis while the null hypothesis is actually false. The probability of type II error is denoted as beta. For example, there is a significant difference on a given metric between two groups, but the test result says there is no significant difference between them. The so-called statistical power is 1-beta.

Type I error (also known as false positive) is rejecting the null hypothesis while the null hypothesis is actually true.

Type II error (also known as false negative) is accepting the null hypothesis while the null hypothesis is actually false

How much we care about type I and type II errors depends on the subject. We need to carefully choose values for alpha and beta for every test that we perform.

Things to consider when doing A/B tests:

Ensure random sampling: Randomly selecting the sample from the population is called random sampling. It is a technique where each sample in a population has an equal chance of being chosen. Random sampling is important in A/B testing because we want the result to be representative of the entire population rather than only the sample itself.
Ensure sample size is large enough: It is important to determine the minimum sample size before conducting the test so we can eliminate a bias from sampling too small samples (i.e. we wish to eliminate what is known as under coverage bias).
Test incremental changes only: It is tempting to test several changes simultaneously but it is typically difficult to pinpoint how many changes influenced the test result. Moreover, changes might affect a given metric in opposite directions.

To perform an A/B test, the following steps need to be followed:

Define a null hypothesis and an alternative hypothesis. The null hypothesis is the status quo assumption that you want to challenge (for example, conversion rates are compatible between two groups), while the alternative hypothesis is its opposite (i.e. they are different).
Choose a significance level. The significance level is typically (but not always) set to 0.05 (i.e. the probability of type I error is 5% ).
Collect the data, calculate a test statistic and calculate a p-value. A p-value is the probability of obtaining such data given that the null hypothesis is true (more about this below).
Compare the p-value to the chosen significance level. If the p-value is less than or equal to the significance level, we can reject the null hypothesis; otherwise, we fail to reject it.

2. What is a p-value?

p-value = prob(data | null hypothesis)

A p-value is the probability of obtaining such a data given that the null hypothesis is true.

What do we use a p-value for?

If the p-value is less than or equal to the significance level, we can reject the null hypothesis; otherwise, we fail to reject it.

What is not a p-value?

The p-value doesn't tell us the probability that the null hypothesis is true.
A/B testing doesn't allow us to accept or provide evidence that the null hypothesis is true, instead we can only fail to provide evidence that it is false.

The p-value doesn't tell us the probability that the null hypothesis is true.

3. How is a test statistic used?

As mentioned before, a test statistic measures how far the data deviates from the null hypothesis. There are different test statistic methods, each serving to different purposes. When using a test statistic, a rejection region is determined and if the test statistic is within this region then we reject the null hypothesis, if not, then we fail to reject the null hypothesis. This is equivalent to checking the p-value, although the p-value also tells you the probability of obtaining such data if the null hypothesis is true.

Do you wish to learn about chi-squared tests and how to implement them in Python? Check out this post: https://www.jonathanbossio.com/post/two-sample-chi-square-test-with-python

Jonathan Bossio

1. A/B testing

2. What is a p-value?

3. How is a test statistic used?

Jonathan
Bossio