Power of A/B-Testing

Introduction

A/B Testing is a very useful technique used to compare the discrete outcome of two versions of something and figure out which performs better. It finds large use in the evaluation of websites and app, but it dates back to the beginning of the 20th century and it’s statistical foundation was formalized by Fisher in 1922.

In its basic form it requires to formulate Hypotesis of the form ‘version A outperforms X% version B (or viceversa)’ and exposes both versions to some set of experimental conditions for a sufficiently long time. The ‘sufficient exposition’ part is paramount in having the test converge to stable statistics. Under fairly general assumptions, one can estimate the minimum number of samples required from the Hypotesis, as the Power of the test.

Despite its long history, it is not uncommon to see under-powered tests and finding spurious or wrong results. It is also quite a pity when those results drive changes to websites and campaigns, because as soon as they are adopted large scale, the sampling increases and the spurious effect regresses to the mean, spurious effects disappear.

Therefore, in this short article we run a few simulated experiments in a controlled environment to show the effects of properly designed test, underpowered tests and premature interpretation of odds ratios and uplifts.

Goal

In this experience we will show the features and the limits of the A/B Testing under usual circumstances and when the experiment is interrupted ahead of time or the design is under-powered. In both cases there is a non negligible chance that the test will show false or misleading results, with the chance dramatically increasing if the test is run under-powered.

Pre-Requisites

This article suitable to readers with an understanding of statistics and experimental setup, in particular a reasonable understanding of Fisher test and power of test is helpful. It is also adequate as a stimulus to less specialized readers to deepen their understanding of comparative statistics when designing A/B Test Comparisons and evaluating their results.

Method and Results

We are running multiple simulated A/B Tests under controlled conditions where we are in control of the generating process. During differen pahses of the experiment we vary the experimental setup in that we change the effect modifying the ratios of the generating processes and the number of samples for each test to mimic underpowered tests. Each setup s run multiple times to show the effects of underpowered tests on the false results.

import numpy as np
import scipy.stats as stats
# let's setup a little experiment
def experiment(p1, p2, n):    
    ''' Runs a single experiment on simulated data with the provided probabilities and number of samples.    
    returns: odds ratio and p value'''
    obs1 = np.random.binomial(n, p1)
    obs2 = np.random.binomial(n, p2)
    obs = np.array([[obs1, n-obs1],[obs2, n-obs2]])
    res = stats.fisher_exact(obs)
    return (res)

def bootstrap_experiment(p1, p2, n, repertitions):
    ''' Runs the experiment multiple times for the given paremeters
    p1: CTR for variant 1
    p1: CTR for variant 2
    n: number of observations per experiment
    repetitions: Number of time we run the same experiment
    returns: odds_ratio , pseudo p-val 
    '''
    data = np.array([experiment(p1,p2,n) for i in range(repertitions)])
    return np.mean(foo[:,0]>1), np.mean(data[:,1]<0.05);

Experiment 1 – Well designed experiment

For this test we assume the conversion rate of the A variant is 4%, and the B variant is 5%, with an uplift of 25%. We keep standard false positive and false negative rates, alpha (0.05) and beta (0.75). In this setup we need 6238 samples per variant.

odds_ratio, p_vals = bootstrap_experiment(0.04,0.05,6238,1000)
print ('Uplift 25% - 6238 Samples')
print ('Odds ratio favourable to variant A:', "{:.2f}".format(odds_ratio*100), '% (misleading result)')
print ('Effect detected in :', "{:.2f}".format(100*p_vals), '% of the cases (true positive ratio)')

Uplift 25% - 6238 Samples
Odds ratio favourable to variant A: 0.30 % (misleading result)
Effect detected in : 74.70 % of the cases (true positive ratio)

As we can see a properly designed test is able to detect the effect in nearly 80% of the cases and has a very low amount of cases where a particular realization of probability is pointing out Variant A as having a better conversion than Variant B.

Experiment 2 – Under-powered design

odds_ratio, p_vals = bootstrap_experiment(0.04,0.05,623,1000)       
print ('Uplift 25% - 623 Samples')
print ('Odds ratio favourable to variant A:', "{:.2f}".format(odds_ratio*100), '% (misleading result)')
print ('Effect detected in :', "{:.2f}".format(100*p_vals), '% of the cases (true positive ratio)')

Uplift 25% - 623 Samples
Odds ratio favourable to variant A: 17.80 % (misleading result)
Effect detected in : 11.20 % of the cases (true positive ratio)

As we can observe, if we reduce the power of the test we experience 2 effects: 1) There is a puzzling high fraction of experiments where Variant A is apparently outperforming Variant B, just due to a set of very unlucky chances 2) A properly conduced fisher test on an underpowered design will accept only a very small amount of the true positives

Experiment 3 – A/A Test

For this test we assume the conversion rate of the A variant is 4%, and the B variant is 4%, with no effect at all, but we assume an uplift of 25% to match experiment 1. We keep standard false positive and false negative rates, alpha (0.05) and beta (0.75). In this setup we use 6238 samples per variant as if we were expecting an uplift of 25% , despite the actual uplift is null.

odds_ratio, p_vals = bootstrap_experiment(0.04,0.04,6238,1000)       
print ('Uplift 0% - 6238 Samples')
print ('Odds ratio favourable to variant A:', "{:.2f}".format(odds_ratio*100), '% (spurious result)')
print ('Effect detected in :', "{:.2f}".format(100*p_vals), '% of the cases (false positive ratio)')

Uplift 0% - 6238 Samples
Odds ratio favourable to variant A: 49.90 % (spurious result)
Effect detected in : 5.50 % of the cases (false positive ratio)

As we can see a properly designed test is able to detect there is no effect in about 95% of the cases of the cases and has a 5% of cases where a particular realization of probability is pointing out Variant B as having a better conversion than Variant A (which is expected for alpha = 0.05). However, if we consider only the odds ratio, we observe spurious uplift in nearly 50% of the cases.

Discussion

The experiments were carried out using values comparable to those of the conversion rates of a digital marketing campaign. An uplift of 25% is a quite generous, but not uncommon, improvement typical to the introduction of a new creative in an exploited ads group. Therefore, we can assume the results are representative of a set of setups commonly encoutered in many A/B experiments run in the optimization of a website or a marketing campaign. The use of a controlled environment and simulated data offers the advantage that we know the groud thruth by design and we can repeat the same experiemnt thousand of times to average out extreme realizations of the stochastic variables.

As one could expect,

properly designed A/B Tests deliver results that are in line with expectation both if the effect is actually present and if there is no effect at all.
Under-powered A/B Tests struggle to detect any effect even if it really present.

Using the odds or comparing the conversion rate without a proper statistical setup

In a properly designed experiment we get along with it thanks to the use of proper statistical tools to design the test
In an underpowered experiment we have an overwhelming amount of false positives indicating the WRONG variant as the most remunerative
If there is no effect at all we find a spurious effect in about 50% of the cases just by pure chance in the realization of stochastic variables.

Conclusion

Running underpowered A/B Test, without the use of and statistical test, is amazing as it can show some serious increase in the conversion rate where there is none. I has also a quite reasonable potential to indicate an uplift where there is actually a drop in conversions. Both those conditions are extremely dangerous because the spurious effects disappear as soon as the change is put in production, while the drop mistaken for an uplift degrades the performance of your website or campaign.

A/B Testing is an extremely powerful tool, it can dramatically change the performance of your website or marketing campaign.

We can help you set up your A/B testing properly.

References

R. A. Fisher, “Studies in Crop Variation I: An examination of the yield of dressed grain from Broadbalk”, Journal of Agricultural Science (1921)
R. A. Fisher, “On the interpretation of χ2 from contingency tables, and the calculation of P”, Journal of the Royal Statistical Society, Vol. 85 (1922)
A. Gallo, “A Refresher on A/B Testing”, Harward Business Review (2017)

Are you ready to take smarter decision?

Otherwise you can always drop a comment…