Designing experiments in LaunchDarkly
Read time: 7 minutes
Last edited: Aug 07, 2020
This guide explains how LaunchDarkly performs experimentation and gives you strategies to design an effective experiment.
Everything we do to our digital properties should be purposeful and provide value to the businesses we support and the people who use our software. How do we know that the decisions we make are valuable?
Experimentation provides the means for us to take user research and ideas from anywhere. It lets us gather data and feedback from our users while minimizing risk. Feature flags can control who sees which variation of a feature and if something goes wrong, you can toggle a flag to turn the feature off. This allows you run experiments in production on your existing userbase.
The guide covers the following topics:
- The purpose and types of experiments
- The necessary components to an experiment
- The elements of experiment design
To complete this guide, you must have the following prerequisites:
- A basic understanding of feature flags
- A basic understanding of your business's needs or Key Performance Indicators (KPIs)
- A basic understanding of metrics and how LaunchDarkly uses them
To enable experimentation in your account, contact your LaunchDarkly account manager or email email@example.com.
It’s not critical to understand these concepts completely, but some awareness of what they are will make this process simpler. Don’t worry if some of these items are complicated. Experimentation is a scientific discipline that takes time to learn and understand.
You should understand these concepts before you read this guide:
Experiments, also known as hypothesis, split, or A/B testing, help us learn, prove that what we've learned to others. You can use experiments to gather supplemental data to confirm or refine our ideas.
In LaunchDarkly, experiments can:
- validate new ideas by testing multiple variations of a feature,
- determine your user base's appetite for a feature before you build it out,
- gather performance data for a feature, service, or API
- increase the adoption of a product,
- drive revenue and conversion rate, and more.
You can turn any feature flag into an experiment by connecting it to a metric you want to track. Because you can wrap any part of your technology stack or product in a feature flag, you can use experiments to test for much more than the efficacy of UI changes.
A hypothesis is a provable statement that can be answered by an experiment. A well-constructed hypothesis has both a positive and negative result defined.
For example, you may hypothesize that changing the button position to the top-right corner will increase click-through by at least 2%, and leaving the button in its current position will not cause any changes in click-through.
To learn more about writing effective hypotheses, read Formulating a hypothesis.
Experiments can test many types of hypothesis.
Here are some examples:
A/B/n testing: Users are given two or more versions of a page to determine which one performs better. Metrics related to engagement, conversion, performance, or other success criteria determine the winning version.
Painted door tests: You create a minimum prototype of a feature to determine if there is really demand for it, who it appeals to, or if you’ve correctly identified your users' problem before trying to solve it. This is the experimentation version of a smoke test, because you do not build the feature out completely. There's only a minimal framework for the feature and an entry point for the user to give feedback.
Champion/challenger: You test extremely different versions of a feature or product against a control simultaneously. It is important that the different versions are as dissimilar to each other as possible in design and function. This lets you determine user response to different prototypes quickly, at the expense of knowing exactly which aspects of a prototype users prefer or dislike.
Chaos engineering: You run experiments in your environments to learn how systems respond to turbulent and unexpected conditions. These experiments might include shutting off servers, services, or connections in a rapidly-reversible way.
It is critical to plan out and document experiments before you run them. Experiment design documents contain the definition of why you are running this test and decisions you want to make based on its outcomes.
There are many tools to help you manage an experiment's roadmap. Jira, Trello, Asana, and Excel or Google Sheets are all great candidates for storing and logging information on experiments.
Your roadmap should contain information on the following:
- Test ID
- Experiment name
- Experiment description
- Audience target. This is information on the ideal target audience and the logic you use to identify them.
- Variations. This is how many flag variations this experiment uses, and what percentage of traffic each is assigned.
- Sample size. This is how much traffic you need before you can check your results and determine an outcome.
- Duration. This is how long your experiment runs. This number is based on sample size.
- Go-live date. This is the date your experiment starts.
- Status. Keep track of whether your experiment is still being drafted, is running, is being analyzed after collecting data, or is complete.
When you plan your experiment, it can be helpful to identify it as a brief, simple explanation of what we are trying to prove and why. You may also want to explain where the rationale behind this experiment originates. References to any user research, prior bugs, or feature requests provide context to what you are trying to achieve.
Here are some example questions you can answer with an experiment:
- Does removing complexity and adding white space result in users spending more time on the site?
- Do more product images lead to increased sales?
- Does page load time increase significantly when search results are sorted?
- What is the best color for a call to action button?
- What is the best location on the page for the button?
Remember that you can approach a solution from many perspectives. That means many hypotheses per idea and, potentially, multiple experiments.
The foundation of an experiment is a question or a hypothesis you want answered. A hypothesis should be specific and seek to answer a single question. You may need to run multiple experiments to answer numerous questions about a feature.
Here are two example hypotheses:
- We believe that by rewriting our results page in React, we will reduce page load time and increase utilization by X%.
- We belive that by adding a chatbot to the second page of our account registration workflow, we will decrease incomplete signups by 15% and have a minimum 1% increase in signup completion.
Ensuring that your hypothesis is robust and as discrete as possible is essential to get useful data from your experiment. An imprecise hypothesis can allow more subjective interpretation of the results.
A great hypothesis has the following considerations:
- Specific: The more specific you are in your expectations, the easier it is to determine whether you have a real effect and what you need to do next.
- Rigorous: The hypothesis must have solid metrics to work towards and review them regularly.
- Multiplicative: A great hypothesis lets you generate further hypotheses. You should be able to build your next hypothesis from your results from the current one. It should generate value no matter what the results of the experiment.
Most hypothesis can be reduced to the following statement:
"By doing [X] to [Y], we expect [Z]."
While you assess an experiment's viability, you must consider how many unique users it takes to get a representative sample of your audience. Sample sizing calculation not only tells you how many users you need but, done properly, it gives you an indication of how long the experiment needs to run for, an estimated impact of the experiment, and other key information you need to prioritize your test.
The majority of statistical models in experimentation don’t indicate when you should stop gathering data. Just because you have stastically significant results doesn't mean your experiment needs to end. By determining a sample before you begin your experiment, you have a clear indicator of when to stop your experiment.
Statistical significance tells you when a variation has impact.
To learn more about statistical signficance, watch this brief video:
Statistical significance can vary over the course of an experiment. You may choose to run the experiment for slightly longer to see if that changes the results.
When a flag variation reaches statistical significance in an experiment, that variation is called out in the LaunchDarkly UI.
To learn more about how LaunchDarkly displays experiment data, read Interpreting experiment data.
A good experiment requires well-defined metrics.
Use these best practices to determine if your metrics meet your needs:
Identifying the right metrics is imperative to getting accurate results from your experiments.
Choosing metrics that correctly measure the effect of a change on your users or codebase can be difficult. Where possible, choose metrics that are a direct result of the changes you are making, rather than those that might be influenced by other factors.
For example, if you know your business' Key Performance Indicators (KPIs), you may be able to break them down into smaller numeric goals, such as an item's average revenue per order, a server's response speed, or a link's click-through rate. These goals might make useful metrics to track in an experiment.
With LaunchDarkly experiments, you can add as many metrics to an experiment as you need. More useful metrics means a higher likelihood of discovering something insightful.
We recommend between four and seven metrics per experiment. With a properly designed experiment, you should be able to make an initial judgment of that many metrics' success quickly. More metrics provide better barometers for success, but as the quantity of measures increases, so does the complexity of making a decision.
To learn more about setting metrics for experiments in LaunchDarkly, read Creating experiments.
One of the advantages of SDK-based experimentation is that your engineering team can almost immediately roll out your successful experiments to the appropriate audiences.
After your experiment acheives statistical significance, analyze the results to identify anomalies or inconsistencies. Verify that all cohorts received a similar number of evaluations, or your results may be biased.
Be prepared to change the flag variations that drive an experiment in order to measure different data. Iterations of an experiment build on the value you have generated or allow you to pivot and investigate further.
In this guide you have learned some of the key concepts of experimentation and how to design an experiment using feature flags.
To complete an exercise in which you create an experiment and assess its results, read Experimenting with feature flags.