Experimentation statistical methodology
Read time: 13 minutes
Last edited: Nov 07, 2024
Experimentation is available to all customers on a Developer, Foundation, or Enterprise plan. If you're on an older Pro or Enterprise plan, Experimentation is available as an add-on. To learn more, read about our pricing. To subscribe to a different plan, contact Sales.
This section includes an explanation of advanced statistical concepts. We provide them for informational purposes, but you do not need to understand these concepts to use Experimentation.
Overview
This guide details the statistical methods LaunchDarkly uses in its Experimentation platform.
For a high level overview of Bayesian statistics and why LaunchDarkly uses it in Experimentation, read Experimentation and Bayesian statistics.
Concepts
An experiment comprises two or more variations, one or more metrics, a randomization unit, and the units assigned to those variations in the experiment. This section defines the mathematical notation which we will use in the remainder of the document.
The Experimentation-related terms and their notations for the purpose of this document include:
-
Variations: An experiment has variations indexed . We will refer to the variation as the control variation.
-
Randomization units and units: The type of the experiment unit is called the randomization unit. Examples of randomization units include user, user-time, organization, and request. A unit is a specific instance of a randomization unit that you assign to a variation in the experiment. In LaunchDarkly, the randomization unit is a context kind, and a unit is a context key. You can read more about randomization units and context kinds at Randomization units. At the time of an analysis, there are units observed for variation , which are indexed .
-
Metrics: An experiment can have one primary metric and several secondary metrics as described in the Metrics topic. The methods described below apply to both primary and secondary metrics. Let be a length vector representing the metric values for variation , and be the observed value of the metric for unit assigned to variation in the experiment.
In LaunchDarkly experiments, a metric's unit of analysis must be the same as the unit of randomization. This means that if your experiment has "user" as the unit of randomization then any metric must also be a user-level metric. Because units in an experiment can be associated with multiple events, all events for a user are aggregated into unit-level metrics as described in the section Average and sum metrics.
Objective of an experiment
Our methods are designed around the belief that the primary objective of an experiment is to make a decision between variations. The experiment results inform that decision by providing estimates of the causal effects on the metrics of interest for each variation.
LaunchDarkly's Experimentation platform uses Bayesian inference for the reasons described in the guide Experimentation and Bayesian statistics. In Bayesian statistics, the decision process is separated into inference and decision steps. Our first step is inference, where we combine our prior beliefs with the available data to estimate the unknown parameters we will use to make our decision. We will represent our beliefs about these parameters in the form of a probability distribution, called the posterior distribution. The second step is to make a decision. Because Bayesian estimates are probability distributions, the experimenter can interpret these estimates as probabilities and incorporate them into their decision process.
In LaunchDarkly experiments, the experimenter wants to learn the average value per unit of the metric conditional on the variation in order to make their decision. While we observe the average value in the experiment samples exposed to a variation, we do not know what the average value of that metric would be if a variation were applied to the entire target population. Let refer to the unknown mean value per unit of the metric of interest for variation . Our statistical methods will estimate a posterior distribution for for each variation .
We summarize the posterior distribution of with the following statistics:
- 90% Credible Interval is a lower and upper value that has a 90% probability of containing true value of
- Posterior mean is a point estimate of
Because the primary purpose of an experiment is for you to decide which variation to launch, we estimate comparisons between variations:
- Probability to be Best: For each variation, the probability that is larger than the of all other variations.
- Relative differences: For each non-control variation, we estimate a posterior distribution of its mean and the control mean .
How LaunchDarkly calculates the posterior distribution of depends on whether the metric is a numeric metric or a conversion metric. We discuss the estimation procedure for each metric type separately in the following sections.
Numeric metrics
Numeric metrics have numeric values associated with their events so they can take any numeric value. Examples of numeric metrics include page load time, efficacy of various search algorithms, and number of items in a shopping cart at checkout. Numeric metrics contrast with conversion metrics which only track whether or not an event occurred. You can read more about creating these metrics in Numeric metrics.
In our statistical methods, numeric metrics are treated as unbound continuous random variables. With numeric metrics, the shape of the data generating distribution for the unit level metric values is unfortunately unknown. However, because we are interested in estimating the population mean , we fortunately can simplify our analysis by appealing to the Central Limit Theorem. Under some regularity conditions, as , the sample mean is approximately normally distributed with location and scale .
For numeric metrics, we use the following likelihood function for the sample mean of the observed data:
For convenience and because is not the primary goal of our inference, we treat as known and equal an estimate of the standard deviation calculated from the sample. Because we use an estimated value for sigma rather than estimating it in the model, our method is an empirical Bayesian method. This is the case with most of our statistical methods, as we are willing to trade off practicality for methodological purity.
To complete the model, we need to specify a prior distribution for . For the control variation, we use an improper non-informative prior . For the other variations, we use priors that shrink the results towards the control variation's mean. We generate this prior from the empirical distribution of relative differences between variations in all experiments on our platform using metrics of the same type (numeric or conversion) and aggregation function (average or sum).
The equation for this prior is:
where is the variance of the distribution of observed relative differences () across all experiments with numeric metrics on the platform. The first term, , scales the expected relative difference by the observed control mean. The second term, , accounts for the uncertainty in our estimate of the control mean. The value of is between 0.13 and 0.19, conditional on the type of the metric.
Combining the likelihood and prior provides the posterior distribution of , which represents our beliefs about after observing the data from the experiment.
Given the normal likelihood and prior, the posterior distribution is also a normal distribution with the following parameters:
The experiment results page displays the posterior distributions of each each variation's mean () in the probability charts.
We use the expected value of the posterior distribution as a point estimate for ,
The experiment results table displays the value of in the Posterior mean column. We use a 90% credible interval of the posterior mean to provide a range or plausible values. Because there are multiple valid methods to calculate credible intervals, we use the highest density interval (HDI), which is the shortest interval that contains 90% of the probability mass of the posterior distribution.
We estimate the relative difference in means between two variations. We define the relative difference in the means of variations and as a parameter . The relative difference in the means also has a posterior distribution. To derive the posterior distribution of , we apply the delta method to and ,
DÃaz-Francés (2013) show that the the approximation we use for the ratio of means holds under reasonable assumptions; you can read more at "On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables".
As with the mean of the metric for a single variation, we use the 90% highest density interval for 90% credible interval. The experiment results table displays the 90% credible interval of the relative difference in means between each variation and the control variation ( for all ) in the column Relative difference from control variation.
Conversion metrics
Conversion metrics in LaunchDarkly indicate whether or not an event occurred. You can read more about defining conversion metrics at Conversion metrics.
We use different models for conversion metrics depending on whether the metric events are aggregated by unit using the average or the sum. If conversion metric events are aggregated by unit using the sum function, then the metric is interpreted as the average number of conversions per unit. We use the methods described in the previous section to estimate the mean of the metric for each variation.
If conversion metric events are aggregated by unit using the average function, the metric is interpreted as the conversion rate, meaning the proportion of users which experienced an event. Using the per-unit average of metric events ignores the number of times a unit converted and results in a binary variable taking values of 0 or 1. Because these conversion metrics are binary, we can use a binomial distribution to model the conversion rate.
Suppose that is the proportion of the units in variation that converted. Then a total of of the units converted, and units did not convert.
To model the total number of conversions (), we use a Bernoulli distribution with proportion parameter and size as the likelihood function:
We denote the proportion parameter as to be consistent with the notation used in the section Numeric metrics.
We use a Beta distribution as the prior for ,
The values of the prior hyperparameters and differ between the control () and treatment variations (). For the control variation (), we use a the uniform distribution with and . For the treatment variations (), we use a prior similar to the one used for numeric metrics. The prior for non-control variations is a Beta distribution with hyperparameters , parameters such that its expected value and variance are:
The value of is the variance of the empirical distribution of relative differences of experiments using a binary metric, and is currently set to .
The posterior distribution of is also a Beta distribution:
The expected value of this distribution is our preferred point estimate of :
The experiment result table displays the value of in the Posterior mean column.
As with numeric metrics, we use the highest density interval for the 90% credible interval of
The experiment results table displays the 90% credible interval of in the column Credible Interval: 90%.
To calculate the relative difference in means between each variation and the control variation ( for all ), we use the same method as for numeric metrics after transforming the posterior distributions of the means to normal distributions by matching the expected values and variances.
The experiment results table displays the 90% credible interval of the relative difference in means between each variation and the control variation ( for all ) in the column Relative Difference from control variation.
Probability to be best
For both numeric and conversion metrics, we calculate the probability to be best for each variation.
The probability to be best is the probability that the mean value per unit of a variation is the largest of all the variations if the success direction is positive. If the success direction is negative, then the probability to be best is the probability that the mean value per unit of a variation is the smallest of all the variations. LaunchDarkly calculates the probability to be best for each variation by taking samples from from the posterior distributions of the 's. The proportion of samples in which a variation is the largest, or smallest if the success direction is negative, is the probability to be best for that variation.
In the case where there are only two variations ( and ) and the success direction of the metric is positive, the probability to be best for variation is the probability that the difference in means is greater than zero.
Sample ratio mismatch
A sample ratio mismatch (SRM) is when the observed proportions of units receiving variations differ from the proportions chosen in the experiment design. An SRM often indicates an error in the experiment implementation and that the experiment results are not valid.
To detect sample ratio mismatches we use the sequential method described in these sources:
- Anytime-Valid Inference for Multinomial Count Data
- A Better Way to Test for Sample Ratio Mismatches (SRMs) and Validate Experiment Implementations
LaunchDarkly alerts you that a sample ratio mismatch has occurred when the posterior odds favoring a mismatch are greater than 99%.
For more about sample ratio mismatches in the product, read Understanding sample ratios.
Average and sum metrics
Because a unit in an experiment can have multiple metric events, but experiment metrics must have one value per unit, we aggregate all experiment metrics events associated with a unit. Suppose unit has events associated with it during the experiment period, and is the value of the th metric event for unit . LaunchDarkly calculates the metric value for unit as follows:
- Average: if else 0,
- Sum: if else 0.
For both aggregation functions, LaunchDarkly treats units for which we do not receive metric events as having a value of zero.
For example, consider a metric named transaction_value
that is defined as the value in dollars of transactions made by a user. If a particular user had transaction_value
events during the experiment period with values 10, 20, and 30, then the when the average aggregation function is used the value of the metric is 20, and when the sum aggregation function is used then the average is 60.
Your 14-day trial begins as soon as you sign up. Get started in minutes using the in-app Quickstart. You'll discover how easy it is to release, monitor, and optimize your software.
Want to try it out? Start a trial.