Analyzing experiment data
Read time: 4 minutes
Last edited: May 20, 2022
This topic explains how to read an experiment's results in the flag's Experimentation tab and apply its findings to your product.
When your experiments are running, you can view information about them on the Experimentation tab for the flags connected to them. The Experimentation tab displays all the experiments a flag is participating in, including both experiments that are currently recording and experiments that are stopped.
Each experiment shows details about what data the experiment is collecting and what that data means. You can view information about the experiment, like its name and experiment type, in the section header.
Here is an image of an experiment:
Here are some things you can do with each experiment:
- Stop the experiment or start a new iteration. To learn more, read The experiment lifecycle.
- Edit metrics connected to the experiment. Edit experiment metrics to start new experiments using this flag.
- Remove the metric from the experiment:
- Visualize experiment data over a set period of time. Click the Iteration menu to select a time frame:
The data an experiment has collected is represented in a table.
Here is an image of numeric experiment's data:
Experiment data differs based on whether you're running a numeric or conversion (click, page view, or custom) experiment.
Not all columns appear for each experiment. Some are unique to numeric or conversion experiments.
The list below explains what each column means:
- Variation: Which flag variation's data is shown in the row.
- Conversions / unique users: How many unique users took action based on the variation they saw, relative to the total number of unique users who viewed the page. For click experiments, this means the denominator is the number of views matching the URL specified in the click metric UI.
- Conversion rate: The percentage of users who took action based on the variation they saw, relative to the total number of unique users who encountered that flag variation.
- Total events: The number of times the metric connected to this flag variation was evaluated. The total events column only updates when the metric associated with the flag variation registers an event. This is not the total number of times the flag variation has ever been evaluated.
- Average: The average numeric value this flag variation returned.
- Confidence interval: A range of values between which the actual value for the conversion rate likely falls. The smaller the confidence interval, the more confident you can be about the prediction. For example, a confidence interval of 11%-13% is more reliable than a confidence interval of 10%-30%.
- Change: The difference, in positive or negative values, the flag variation has from the baseline variation. Baseline flag variations say Baseline rather than showing a value.
- P-value: The likelihood that differences in user behavior within an experiment are random, rather than a result of receiving different variations of a flag. LaunchDarkly uses the industry standard of a p-value of .05 or lower to consider results statistically significant. The lower the p-value, the more likely it is that the results of your experiment are statistically significant. For example, a p-value of .03 means there's a 97% chance differences in user behavior are due to the flag variations you tested, and there's a 3% chance any differences in user behavior were random.
LaunchDarkly uses variation calls to correlate conversions and unique user impressions within a 24-hour window. If there are more than 24 hours between the time of a variation call and a track call on a user key, then LaunchDarkly does not register them in the experiment.
If you want to correlate conversions and unique user impressions that occur more than 24 hours apart, call variation again so LaunchDarkly registers them as part of the experiment.
If a user receives multiple variations before a conversion, the conversion is attributed to the variation the user saw most recently.
After your experiment has registered enough events to achieve statistical significance compared to the baseline flag variation, LaunchDarkly declares a winning variation. A winning variation appears when that variation reaches 95% statistical significance.
If you run an experiment over an extended period of time, the winning variation may change. Review your experiment data regularly to make sure you're making the most informed choices about your business and product.
You can view the statistical significance of your metrics, and the winning versions, in the "Statistical significance" column. The baseline variation does not show statistical significance.
Here is an image of an experiment with a winning variation designated:
Your winning variation is the flag variation that has the most positive impact compared to the baseline. If you're doing multivariate testing, which means comparing multiple different flag variations to the baseline simultaneously, you may have multiple winning variations. If you do, you can choose which variation best meets your needs and roll that flag out to your entire user base.
After you determine a winning variation, you can roll the winning variation out to 100% of your users from the flag's targeting page. To learn more, read Percentage rollouts.
If you're done with an experiment and have rolled the winning variation to your user base, it might be a good time to stop your experiment. Experiments on a user base that only receives one flag variation do not return useful results. Stopping an experiment retains all the data collected so far. To learn more, read Managing experiments.
If you're using Data Export, you can find experiment data in your Data Export destinations to further analyze it using third-party tools of your own.
To learn more, read Data Export.