top of page

Two-sample t-test with Python


This is a step-by-step guide on how to implement a t-test for A/B testing in Python using the SciPy and NumPy libraries.


Check out this post for an introduction to A/B testing, test statistic, significance level, statistical power and p-values.
If you are already familiar with two-sample t-tests, feel free to jump to Section 3 where I explain how to implement such a test in Python.

Table of contents




1. Two-sample t-tests


In this post, we will use a two-tailed t-test statistic which is well suited for continuous data.

Two-tailed means that the mean of one sample can be smaller or larger than the other. In a one-tailed test, the mean of the first sample is smaller or larger than the mean of the second sample.

The t-test that I will describe here is the so-called unpaired t-test which compares the means of two independent groups. If the groups are instead dependent a paired t-test should be performed [1].


If the populations from which the sample means are extracted have equal variances and the sample means are normally distributed, a Student's t-test should be used. The Welch's t-test is designed for unequal population variances, but the assumption of normality remains.


The test consists in calculating a t-statistic, a critical t-statistic value and a p-value. If the absolute value of the observed t-statistic is below the critical t-statistic value, we fail to reject the null hypothesis, otherwise we reject the null hypothesis. We can also calculate a p-value, which tells you the probability of obtaining such data if the null hypothesis is true. In both cases, the t distribution is used which depends on the degrees of freedom. The degrees of freedom represent the number of values in the final calculation of a statistic that are free to vary.


The Student's t-test statistic is defined as follows [1]:

Where the s terms are the respective unbiased estimators of the population variances. In this case, the degrees of freedom is n1 + n2 - 2.


If both samples have the same size, the above equation is simplified to the following form [1]:

In this case, the degrees of freedom is 2n - 2.


The Welch's t-test statistic is defined as follows [1]:

In this case the degrees of freedom (df) is calculated in the following way:

Before continuing let's briefly introduce some concepts that will be needed to understand my implementation of a t-test in Python.


Normal distribution:

Normal (also called Gaussian) distributions are symmetrically distributed. They are shaped as a bell and are characterized by a mean (mu) and a standard deviation (sigma).


Cumulative distribution function:

The cumulative distribution function (CDF) of a random variable X evaluated at x is the probability that X will take a value less than or equal to x.


Z-scores:

A z-score measures the distance from the mean in terms of the standard deviation.


2. Introducing the case example: daily conversion rates


Suppose we have a hotel booking website and wish to study if a given change in our website can boost our average daily conversion rates (at the final stage of the booking process). We decide then to make an A/B test to help us determine if we want to release such a change. For this example, let's set the significance level to 0.05 (alpha) and the statistical power (1-beta) to 0.8 (the statistical power will be used to define the minimum sample size in Section 3.1).


In this example, our null hypothesis states that there is no significant difference between the conversion rates with or without such a change in the website.


3. Implementing a t-test in Python


Let's start by importing all the libraries and functions that we will need:

from scipy.stats import norm, t, ttest_ind
from scipy.special import stdtr
import numpy as np
import math

3.1 Estimating the minimum sample size


Before running an A/B test, we need to estimate the minimum sample size required to observe a difference at least as large as our desired minimum detectable effect (MDE) with the chosen significance level and statistical power. If the sample size of our data is below such a minimum sample size, even if we see a difference larger than the minimum detectable effect, we might not be able reject the null hypothesis since the difference would not be statistically significant. For the example defined above, the minimum sample size corresponds to the number of days the experiment would need to run.


If sigma is the standard deviation (assuming both groups have the same standard deviation which is taken from historical data), then here is the equation [2] to calculate the minimum sample size (n):


Note: the above Z-score values are calculated using a Normal distribution with mean zero and standard deviation equal to 1.


If you wish instead to perform a one-tailed t-test, you need to make the following change to Equation 2:

Here is an example of how we can implement Equation 2 in Python:

def get_min_sample_size(
        std_dev,  # standard deviation
        mde,  # minimum detectable effect
        alpha = 0.05,  # significance level
        power = 0.8  # statistical power
    ):
    """
    Estimate minimum sample size for t-test
    Assumptions:
        Sample sizes will be the same for both groups
        Both groups have the same standard deviation
    """
    
    # Find Z_beta from desired power
    Z_beta = norm.ppf(power)

    # Find Z_alpha
    Z_alpha = norm.ppf(1 - alpha / 2)

    # Return minimum sample size
    return math.ceil(2 * std_dev**2 * (Z_beta + Z_alpha)**2 / mde**2)

Let's calculate the minimum sample size using the function defined above for sigma = 0.05, and using the values chosen for our example for alpha (0.05) and power (0.8), and let's set the minimum detectable effect to 0.03:

min_sample_size = get_min_sample_size(
    std_dev = 0.05,
    mde = 0.03,
    alpha = 0.05,
    power = 0.8
)

The above will give us 44, that is the needed number of days we need to run the experiment.


3.2 Generating simulated data


Let's create a function to generate data for group A and B, each with the same sample size. First, we will set a seed to get reproducible results (so we all get the same results). Then, we will generate data using normal distributions, both having the same standard deviation, such that we can use the Student's t-test.


Here is the function that does all the above:

def generate_data(
        sample_size,
        avg_daily_conversion_rate_A,  # avg daily conversion rate for group A
        avg_daily_conversion_rate_B,  # avg daily conversion rate for group B
        std_dev = 0.05  # standard deviation
    ):
    """Generate fake data to perform a two-sample t-test"""

    # Set a random seed for reproducibility
    np.random.seed(42)

    # Generate data for group A and B
    group_A = np.random.normal(avg_daily_conversion_rate_A, std_dev, sample_size)
    group_B = np.random.normal(avg_daily_conversion_rate_B, std_dev, sample_size)

    return group_A, group_B

Let's now use the above function to generate data incompatible with the null hypothesis, i.e. let's generate two samples each with a different daily conversion rate. I will choose a large-enough difference that will allow us to see it in our t-test (i.e. difference > minimum detectable effect).

group_A, group_B = generate_data(
    sample_size = min_sample_size,
    avg_daily_conversion_rate_A = 0.2,
    avg_daily_conversion_rate_B = 0.23,
    std_dev = std_dev
)

3.3. Running a t-test


3.3.1. Rejecting the null hypothesis


We will use the ttest_ind() function from SciPy to retrieve the t-statistic and the p-value:

result = ttest_ind(group_A, group_B)
pvalue = result.pvalue
if pvalue < alpha:
       print(f"Decision: There is a significant difference between the groups (p-value = {pvalue}).")
else:
       print(f"Decision: There is no significant difference between the groups (p-value = {pvalue}).")
tstat = result.statistic
print(f't-statistic = {round(tstat, 2)}'

The above will return:

Decision: There is a significant difference between the groups (p-value = 0.00023115984392950252).
t-statistic = -3.84

Note:

  • The above assumes equal variances. Set the equal_var argument to False to perform a Welch's t-test which doesn't assume equal variances.

  • The above works for a two-tailed t-test. If you wish to perform a one-tailed t-test, set the alternative argument to 'less' (the mean of the first sample is smaller than the mean of the second sample) or 'greater' (the mean of the first sample is larger than the mean of the second sample) as appropriate.


Since the p-value is well below the cut off of 0.05, we can reject the null hypothesis. This means there is a difference (that is statistically significant) between the two groups of users.


If this would be a real-life experiment, this points out that it might be a good idea to roll out the A -> B change to all users (if the difference is in a positive direction). Said that, we might want to further support this change by additional studies. For example, by running the experiment a second time.


Let's convince ourselves that we are doing things correctly and calculate the t-statistic by hand following Equation 1. Here is how we can implement it in Python:

avgA = np.mean(group_A)
avgB = np.mean(group_B)
varA = np.var(group_A, ddof = 1)
varB = np.var(group_B, ddof = 1)
n = min_sample_size
my_tstat = (avgA - avgB) / math.sqrt((varA + varB)/ n)
print(f't-statistic calculated by hand = {round(my_tstat, 2)}')

The above code gives the following:

t-statistic calculated by hand = -3.84

which agrees with the t-statistic (tstat) retrieved with the ttest_ind() function.


Let's now calculate the critical t-statistic value using the percent point function which is the inverse of the cumulative distribution function. With this, we obtain the value of the t-statistic distribution (for the given degrees of freedom) corresponding to the chosen value of alpha. In other words, this critical value ensures that 1 minus the cumulative distribution function evaluated at the critical value is equal to alpha (0.05). If the absolute value of our observed t-statistic is below this critical t-statistic value, we fail to reject the null hypothesis, otherwise we reject the null hypothesis.


Here is how we can calculate the critical value in Python:

df = 2 * n - 2  # degrees of freedom
critical_t_stat = round(t.ppf(1 - alpha / 2, df), 2)  # ppf = percent point function (inverse of the cumulative distribution function)
print(f'critical t-statistic = {critical_t_stat}')

Which prints the following:

critical t-statistic = 1.99

Since the absolute value of the observed t-statistic (3.84) is higher than the critical t-statistic (1.99), we can then reject the null hypothesis.


Note: If you wish to perform a one-tailed t-test, critical_t_stat should be calculated in the following way instead:

critical_t_stat = round(t.ppf(1 - alpha, df), 2)  # ppf = percent point function (inverse of the cumulative distribution function)

Furthermore, we can also calculate the p-value by hand and validate the value we obtained before. The p-value is calculated in the following way


if t-statistic <= 0: p-value = 2 × (area to the left of the t distribution)

if t-statistic > 0: p-value = 2 x (area to the right of the t distribution)


We can implement the above in the following way:

my_pvalue = 2 * stdtr(df, -np.abs(my_tstat))  # stdtr = Student t distribution cumulative distribution function
print(f'p-value calculated by hand = {my_pvalue}')

Which prints the following (which agrees with the value obtained above):

p-value calculated by hand = 0.00023115984392950252

3.3.2. Failing to reject the null hypothesis


Let's generate new data that presents a difference below the minimum detectable effect and re-run the t-test:

group_A, group_B = generate_data(
    sample_size = min_sample_size,
    avg_daily_conversion_rate_A = 0.2,
    avg_daily_conversion_rate_B = 0.201,
    std_dev = std_dev
)

Let's now run the t-test:

result = ttest_ind(group_A, group_B)
pvalue = result.pvalue
if pvalue < alpha:
    print(f"Decision: There is a significant difference between the groups (p-value = {pvalue}).")
else:
    print(f"Decision: There is no significant difference between the groups (p-value = {pvalue}).")

This is the result:

Decision: There is no significant difference between the groups (p-value = 0.33914179411708234).

Since the obtained p-value is larger than 0.05 (i.e. our choice for alpha), then we fail to reject the null hypothesis.


3.3.3. Repository with full code


The full code can be found in the following repository: https://github.com/jbossios/two-sample-t-test-in-python


Do you wish to learn how to implement a chi-square test in Python? Check out this post.
Do you wish to learn all the technical skills needed to perform a data analysis in Python? Check out my free Python course for data analysis: https://github.com/jbossios/python-tutorial

References

[1] Fundamentals of Biostatistics (Seventh Edition) by Bernard Rosner

Recent Posts

See All

コメント


コメント機能がオフになっています。
bottom of page