Experimentation statistical methodology for frequentist experiments
Read time: 8 minutes
Last edited: Dec 14, 2024
This section includes an explanation of advanced statistical concepts. We provide them for informational purposes, but you do not need to understand these concepts to use Experimentation.
Overview
This guide explains the statistical methods LaunchDarkly applies to frequentist experiments in its Experimentation platform.
For a high-level overview of frequentist and Bayesian statistics, read Bayesian versus frequentist statistics.
Concepts
An experiment comprises two or more variations, one or more metrics, a randomization unit, and the units assigned to those variations in the experiment. This section defines the mathematical notation which we will use in the remainder of the document.
The Experimentation-related terms and their notations for the purpose of this document include:
- Variations: An experiment has variations indexed . We will refer to the variation as the control variation.
- Randomization units and units: The type of the experiment unit is called the randomization unit. Examples of randomization units include user, user-time, organization, and request. A unit is a specific instance of a randomization unit that you assign to a variation in the experiment. In LaunchDarkly, the randomization unit is a context kind, and a unit is a context key. At the time of an analysis, there are units observed for variation , which are indexed .
- Metrics: An experiment can have one primary metric and several secondary metrics. The methods described below apply to both primary and secondary metrics. be a is the length vector representing the metric values for variation , and is the observed value of the metric for unit assigned to variation in the experiment.
In LaunchDarkly experiments, a metric's unit of analysis must be the same as the unit of randomization. This means that if your experiment has "user" as the unit of randomization, then any metric in the experiment must also have "user" as the unit of analysis.
Objective of an experiment
Our methods are designed around the belief that the primary objective of an experiment is to make a decision between variations. The experiment results inform that decision by providing estimates of the causal effects on the metrics of interest for each variation.
In LaunchDarkly experiments, the experimenter wants to learn the average value per unit of the metric conditional on the variation in order to make their decision. While we observe the average value in the experiment samples exposed to a variation, we do not know what the average value of that metric would be if a variation were applied to the entire target population.
Let and refer to the unknown mean and variance value per unit of the metric of interest for variation . The observed average value in the experiment samples exposed to variation is denoted as:
The estimated standard deviation is denoted as .
We summarize the key statistics of as follows:
- Mean or Conversion rate: a point estimate of based on the type of metric. For a conversion metric with the unit aggregation method set to "average," such as a custom conversion binary metric, it is interpreted as the conversion rate. For other metrics, it is interpreted as the mean.
- 90% Confidence interval: a range defined by a lower and upper bound that is likely to contain the true value of with 90% confidence level. This means that if the population is sampled repeatedly and a 90% confidence interval is calculated for each sample, about 90% of those intervals will contain the true value of .
Because the primary purpose of an experiment is for you to decide which variation to launch, we estimate comparisons between a non-control variation and the control variation:
- Relative difference from control: For each non-control variation , we calculate a point estimate and a 90% confidence interval for the relative difference from control, expressed as .
- p-value: For each non-control variation, the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the data, in favor of the alternative hypothesis, assuming the null hypothesis is true.
Mean or conversion rate and 90% confidence interval
The shape of the data generating distribution for the unit level metric values is, unfortunately, unknown. However, because we are interested in estimating the population mean , we can simplify our analysis by appealing to the Central Limit Theorem. Under some regularity conditions, the sample mean is approximately normally distributed as
For practical purposes, since estimating the standard deviation is not the primary goal, we substitute the true standard deviation with its sample estimate . While this substitution technically results in a student's t-distribution, with a sufficiently large sample size, the t-distribution closely approximates a normal distribution.
We compute the 90% confidence interval by:
Relative difference from control
When the sample size is large enough, you can approximate by a normal distribution, as justified by the Central Limit Theorem. Assuming independence between the non-control and control variations, we use the delta method to derive the confidence interval for the relative difference from control. This gives us:
For practical purposes, as in the previous section, we use a plug-in estimate for the standard deviation of the relative difference from control:
Two-sided test
The two-sided 90% confidence interval for the relative difference from control is:
This two-sided 90% confidence interval provides a range within which the true relative difference is likely to fall with 90% confidence.
One-sided test
For a one-sided test, the 90% confidence interval for the relative difference from control depends on the success criteria of the metric. You define the success criteria when you create the metric.
The one-sided 90% confidence interval for the relative difference from control is computed as follows depending on the metric's success criteria
- Higher is better: [
- Lower is better: ]
P-value
P-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one observed in the sample, assuming the null hypothesis is true. The test statistic is computed based on the observed relative difference from control:
The null hypothesis is defined differently for two-sided and one-sided tests, and the interpretation of the p-value depends on the type of test.
Two-sided test
For a two-sided test:
- Null Hypothesis: There is no difference between the non-control variation's mean and the control variation's mean.
- Alternative Hypothesis: There is a difference between the means.
The two-sided p-value is calculated as:
Here, denotes the cumulative distribution function (CDF) of a standard normal distribution.
The practical interpretation is that a low two-sided p-value means there is strong evidence that the non-control variation's performance differs from the control. A high p-value means the data does not provide strong evidence of any difference.
One-sided test
For a one-sided test:
- Null Hypothesis: The non-control variation is performing worse than or equal to the control variation on average.
- Alternative Hypothesis: The non-control variation is performing better than the control on average.
The one-sided p-value depends on the metric's success criteria:
- Higher is better:
- Lower is better:
Here, denotes the CDF of a standard normal distribution.
The practical interpretation is that a low one-sided p-value indicates strong evidence that the non-control variation outperforms the control variation. A high p-value suggests insufficient evidence to claim superiority of the non-control variation relative to the control.
Statistical significance
Statistical significance indicates the likelihood that the observed relationship or effect in the data is not due to random chance. When a test result is statistically significant, it means the observed effect is unlikely to have occurred purely by random variation.
A p-value at or below the experiment-specified significance level implies the result is statistically significant. The practical interpretation of a statistically significant result differs for the type of test:
- Two-sided Test: A statistically significant result implies that the difference in performance between the control and non-control variations is unlikely to have occurred by chance. However, a two-sided p-value does not indicate the direction of the difference, either better or worse. To interpret the result, we rely on the sign of the relative difference from control:
- Desired direction: If the sign of the relative difference aligns with the metric's success criteria, it indicates that the non-control variation is performing better than the control variation. Specifically, the relative difference from control is positive when the metric's success criteria is "Higher is better" and negative when it is "Lower is better." In such cases, we refer to the significant result as being in the "desired direction."
- Undesired direction: If the sign of the relative difference is opposite to the success criteria, it suggests that the non-control variation is performing worse than the control variation.
- One-sided Test: A statistically significant result indicates that the non-control variation's better performance compared to the control variation is unlikely to be due to random chance.
Conclusion
This guide explained the statistical methods LaunchDarkly applies to frequentist experiments. To learn about Bayesian statistical methods in LaunchDarkly, read Experimentation statistical methodology for Bayesian experiments.
Your 14-day trial begins as soon as you sign up. Get started in minutes using the in-app Quickstart. You'll discover how easy it is to release, monitor, and optimize your software.
Want to try it out? Start a trial.