Risky Procedures in Experiments
Covers common risky procedures during experiment design, execution, and analysis.
Design-phase risks
1. Starting an experiment without a clear hypothesis
What it is
- Launching an experiment on intuition or a vague idea, such as "let's try a new button color" or "optimize the homepage"
- Not pre-defining a specific, quantifiable hypothesis — for example, missing a clear statement like "Changing the CTA from gray to orange is expected to increase mobile click-through rate by ≥10%"
Why it's harmful
- The data collected cannot answer any question — without a hypothesis there is no decision criterion; the result, however it turns out, yields no clear conclusion
- Leads to "shotgun experiments" — choosing whichever metric looks favorable is equivalent to hunting for significant results in noise
- Too many primary metrics — unclear hypotheses often lead to registering many "just-check" metrics; the more you check, the higher the false positive rate
- Wastes traffic and time — occupies user traffic but produces no actionable conclusions
Typical scenario: A product manager says "I have a feeling this feature will do well" and starts an experiment. Two weeks later the data science team is asked "what's the conclusion?" — no one can answer, because no one defined what "well" means.
Solutions
- Write down the hypothesis before the experiment, covering three elements:
- Who (what will be affected)
- What (which metric)
- How much (direction and magnitude)
- Limit the number of primary metrics to those directly tied to the hypothesis
- Establish an experiment review process — experiments with unclear hypotheses should not be launched
- Document explicitly in the experiment brief: Hypothesis → primary metrics → guardrail metrics → decision rules
Good hypothesis example: "Changing the checkout-page CTA from 'Continue' to 'Buy Now' is expected to increase desktop checkout conversion by ≥5%, while cart abandonment rate does not increase by more than 1%."
2. Wrong randomization unit
What it is
- Mismatch between randomization unit and analysis unit (e.g., assigning by device but analyzing by user)
Why it's harmful
- Device ID issues:
- One user accessing from multiple devices → assigned to different treatment groups → inconsistent experience
- Multiple people sharing a device → behavior incorrectly attributed to one individual → contaminated data
- Unit mismatch: Standard errors in the analysis are calculated incorrectly; p-values are unreliable
Solutions
- User ID is the best default choice — persistent across devices, ensures consistency
- Ensure randomization unit = analysis unit
3. Ignoring guardrail metrics and side effects
What it is
- Focusing only on the target metric (e.g., click-through rate) while ignoring other key business metrics
- Not setting guardrail metrics — such as crash rate, page load time, revenue
- Not considering the indirect impact of feature changes on other product areas
Why it's harmful
- Local optimization, global degradation — click-through rate improves but revenue drops (e.g., from misleading copy)
- User retention suffers — short-term metrics look good but long-term retention declines
- Technical risks go unmonitored — a new feature causes crash rate to spike with no one watching
Solutions
- Every experiment must include:
- Target metrics (what you want to improve)
- Guardrail metrics (what must not get worse, e.g., crash rate, revenue, retention)
- Exploratory metrics (metrics that might be affected, for discovering unexpected effects)
- Pre-define decision rules in the experiment design: define at what threshold of guardrail metric degradation the experiment must be stopped
- Actively check guardrail metrics after launch, not just target metrics
Execution-phase risks
1. Peeking — making decisions before the target sample size is reached
What it is
- The experiment is scheduled for two weeks, but you stop it after three days because the results look "significant"
- Continuously monitoring the experiment dashboard and declaring victory as soon as the p-value drops below 0.05
- Extending when results look bad, stopping early when results look good — in essence, selectively harvesting significant results
- Looking at results multiple times but only reporting the "best" one
Why it's harmful
- False positive rate inflates significantly — every peek is an implicit hypothesis test; multiple peeks push the actual α well above 0.05
- Random fluctuation in p-values — during an experiment, p-values can oscillate between significant and non-significant; any single snapshot does not represent the final result
Key insight: The problem with peeking is "optional stopping" — you are not choosing a random point in time to read results; you are stopping because the results look significant. This is no different in principle from flipping a coin repeatedly and only reporting the round where all flips came up heads.
Solutions
- Option 1: Adhere strictly to a fixed-sample design
- Pre-calculate required sample size and run duration
- Do not look at results before the target is reached (or if you do, make no decisions based on them)
- Conduct one final analysis once the target is reached
- Option 2: Establish strict monitoring rules
- For day-to-day monitoring, watch only guardrail metrics (to catch serious problems) — do not check the significance of primary metrics
- Check the significance of primary metrics only at pre-specified analysis time points
2. SRM (Sample Ratio Mismatch) goes undetected
What it is
- The experiment is configured as 50/50, but actual data shows a 60/40 or similar imbalance
- No SRM check is run, or an SRM alert fires but is ignored
Why it's harmful
- All analytical conclusions are untrustworthy — SRM means randomization has been broken; the treatment group and control group are no longer comparable
- Direction of bias is unpredictable — the true effect may be over- or under-estimated
- The most hidden problem — the data looks "normal," metrics show significant differences, but the root cause is data collection bias, not a product effect
Solutions
- Every experiment must run an SRM check as a required pre-launch validation (see Sample Ratio Mismatch under the statistics engine)
- When SRM is detected, fix the root cause and restart the experiment — using contaminated data for decisions is not recommended
- Common fixes: fix crash bugs, adjust the exposure trigger point, correct assignment logic
3. Code bugs during execution introduce confounding variables
What it is
- The experiment feature code has a bug, causing the treatment group user experience to differ from what was intended
- The experiment code itself introduces performance issues (slower loading, rendering anomalies)
- The experiment assignment logic is wrong; some users are not actually placed in the correct group
- Instrumentation code errors cause events to be dropped or double-counted
Why it's harmful
- You are testing something other than what you intended — you think you are measuring the new feature's effect; you are actually measuring the effect of the bug
- Performance issues mask the real effect — the feature itself may be good, but 2 extra seconds of loading time degrades the entire user experience
- Assignment errors cause SRM — some "treatment group" users are actually seeing the control experience
Solutions
- Test thoroughly before launch, simulating real user flows
- Use automated integration tests to validate experiment functionality
- In the early launch phase, ramp traffic gradually (e.g., 1%) and confirm no critical bugs before increasing allocation
- Monitor in real time technical metrics such as crash rate, load time, and error rate
4. Incorrect exposure trigger point
What it is
- The treatment group triggers exposure deep in the page; the control group triggers it at the top
- Exposure timing is misaligned with the user's actual experience
- The exposure event is reported before the user has actually seen the experiment feature
- Under certain conditions the exposure event fails silently (e.g., on poor network connections)
Why it's harmful
- Introduces selection bias — the control group includes all visitors; the treatment group includes only users who "successfully reached" the feature; the two groups are no longer comparable
- Metric calculation errors — inconsistent denominators make conversion rates and other metrics incomparable
- Causes SRM — the two groups report inconsistent exposure counts
Typical scenario: The new feature in the treatment group is at the bottom of the page; users must scroll to see it. Exposure fires when the feature renders, while the control group's exposure fires when the page loads. Result: the control group's "exposed users" include everyone; the treatment group includes only users who scrolled to the bottom — who are already more engaged users.
Solutions
- Align the exposure trigger timing — both groups' exposures should fire at the same user-behavior event
- If the experiment feature requires user action to activate, the control group should also have a corresponding "virtual exposure" event
- Ensure exposure events are reliably reported; add failure-retry logic
- Regularly compare observed exposure counts against expected allocations to catch losses early
5. Reducing traffic during the experiment
Reducing traffic during an experiment can introduce several significant risks that may compromise the validity, reliability, and overall success of the experimental outcomes. It is crucial to understand these risks in detail.
- Reduced statistical power — Lower traffic can lead to smaller sample sizes, which in turn reduces the statistical power of the experiment. This makes it more difficult to detect significant differences between the control and treatment groups, increasing the likelihood of Type II errors (false negatives).
- Effect dilution — ABC accumulates user experiment data from the time of the user's first exposure. After reducing the experiment traffic, some users will no longer be influenced by the new product features but will still be counted in the experiment group, diluting the experiment effect.
- Bias introduction — If traffic reduction is not uniformly applied across all groups, it can introduce selection bias. This can result in an unrepresentative sample that skews the results and undermines the experiment's validity.
- Inconsistent user experience — Reducing traffic might lead to inconsistent user experiences, particularly if the reduction is not managed carefully. For instance, users who have already been exposed to the new product features may experience a degradation in user experience, which can affect user behavior and introduce additional variability into the results.
6. Changing parameters during the experiment
Changing experiment parameters during an online controlled experiment can introduce several risks that can compromise the integrity and validity of the experiment. Here are some detailed risks associated with changing experiment parameters.
- Bias introduction — Participants who were exposed to different sets of parameters might become aware of the feature changes, which can influence their behavior and responses, leading to unknown bias.
- Inconsistent user experience — Changing parameters might lead to inconsistent user experience, which can affect user behavior and introduce additional variability into the results.
- Interpretation challenges — Changes in parameters can lead to ambiguous results that are difficult to interpret. It becomes challenging to determine whether observed effects are due to the parameter changes or the original experiment conditions.
- Replication issues — Future replication of the experiment becomes difficult if the parameters were not consistent throughout the original experiment.
Analysis-phase risks
1. Cherry-picking — selecting only favorable results
What it is
- Picking 3 favorable metrics out of 5, ignoring the 2 unfavorable ones
- Treating a significant effect found only in one segment as the overall conclusion
- Selecting "good" or "bad" numbers that are unrelated to the experiment hypothesis
- Slicing the data repeatedly until some cut shows a significant result
Why it's harmful
- Fabricated "success" stories — among 20 metrics you will always find 1 significant one by chance alone (5% probability); reporting only that one is misleading
- Wrong product decisions — shipping a feature based on cherry-picked metrics may produce a net negative real-world effect
- Erodes data credibility — over time, stakeholders stop trusting experiment conclusions
Solutions
- Report all pre-specified primary metric results, regardless of outcome
- Significant results should have a plausible causal explanation — statistically significant results that cannot be explained causally should be treated as likely false positives
- Consistency across multiple independent metrics increases credibility — if several related metrics all point in the same direction, even borderline p-values are more convincing
- Treat interesting segment findings discovered post-hoc as hypotheses for the next experiment, not conclusions from the current one
2. Ignoring novelty and learning effects
What it is
- Novelty effect — users engage more with a new feature in the short term because it is new, but behavior reverts to baseline over time
- Learning effect — a new feature requires time to adapt to; it performs poorly in the short term but may improve over the long term
- Drawing conclusions from early experiment metrics only, without observing trend changes over time
- Experiment duration is too short to distinguish short-term novelty from long-term impact
Why it's harmful
- False positive conclusion — novelty effect makes the new feature look good, but the effect fades after launch
- False negative conclusion — learning effect makes the new feature look bad initially, but it may benefit users long-term
- Reversed decisions — metrics drop after launch, forcing a rollback and wasting development resources
Solutions
- Run the experiment long enough — at least covering one full user-behavior cycle (typically ≥1 week)
- Look at day-by-day metric trends — if the effect decays over time → likely a novelty effect; if it grows over time → likely a learning effect
- Analyze differences between new and returning users — new users have no "prior experience" to compare against and are not subject to the novelty effect
- Use retention and revisit metrics to supplement the assessment of long-term impact
- For experiments at risk of novelty effects, consider staged rollouts with ongoing observation