Risky Procedures in Experiments

Covers common risky procedures during experiment design, execution, and analysis.

Design-phase risks

1. Starting an experiment without a clear hypothesis

What it is

Launching an experiment on intuition or a vague idea, such as "let's try a new button color" or "optimize the homepage"
Not pre-defining a specific, quantifiable hypothesis — for example, missing a clear statement like "Changing the CTA from gray to orange is expected to increase mobile click-through rate by ≥10%"

Why it's harmful

The data collected cannot answer any question — without a hypothesis there is no decision criterion; the result, however it turns out, yields no clear conclusion
Leads to "shotgun experiments" — choosing whichever metric looks favorable is equivalent to hunting for significant results in noise
Too many primary metrics — unclear hypotheses often lead to registering many "just-check" metrics; the more you check, the higher the false positive rate
Wastes traffic and time — occupies user traffic but produces no actionable conclusions

Typical scenario: A product manager says "I have a feeling this feature will do well" and starts an experiment. Two weeks later the data science team is asked "what's the conclusion?" — no one can answer, because no one defined what "well" means.

Solutions

Write down the hypothesis before the experiment, covering three elements:
1. Who (what will be affected)
2. What (which metric)
3. How much (direction and magnitude)
Limit the number of primary metrics to those directly tied to the hypothesis
Establish an experiment review process — experiments with unclear hypotheses should not be launched
Document explicitly in the experiment brief: Hypothesis → primary metrics → guardrail metrics → decision rules

Good hypothesis example: "Changing the checkout-page CTA from 'Continue' to 'Buy Now' is expected to increase desktop checkout conversion by ≥5%, while cart abandonment rate does not increase by more than 1%."

2. Wrong randomization unit

What it is

Mismatch between randomization unit and analysis unit (e.g., assigning by device but analyzing by user)

Why it's harmful

Device ID issues:
- One user accessing from multiple devices → assigned to different treatment groups → inconsistent experience
- Multiple people sharing a device → behavior incorrectly attributed to one individual → contaminated data
Unit mismatch: Standard errors in the analysis are calculated incorrectly; p-values are unreliable

Solutions

User ID is the best default choice — persistent across devices, ensures consistency
Ensure randomization unit = analysis unit

3. Ignoring guardrail metrics and side effects

What it is

Focusing only on the target metric (e.g., click-through rate) while ignoring other key business metrics
Not setting guardrail metrics — such as crash rate, page load time, revenue
Not considering the indirect impact of feature changes on other product areas

Why it's harmful

Local optimization, global degradation — click-through rate improves but revenue drops (e.g., from misleading copy)
User retention suffers — short-term metrics look good but long-term retention declines
Technical risks go unmonitored — a new feature causes crash rate to spike with no one watching

Solutions

Every experiment must include:
1. Target metrics (what you want to improve)
2. Guardrail metrics (what must not get worse, e.g., crash rate, revenue, retention)
3. Exploratory metrics (metrics that might be affected, for discovering unexpected effects)
Pre-define decision rules in the experiment design: define at what threshold of guardrail metric degradation the experiment must be stopped
Actively check guardrail metrics after launch, not just target metrics

Execution-phase risks

1. Peeking — making decisions before the target sample size is reached

What it is

The experiment is scheduled for two weeks, but you stop it after three days because the results look "significant"
Continuously monitoring the experiment dashboard and declaring victory as soon as the p-value drops below 0.05
Extending when results look bad, stopping early when results look good — in essence, selectively harvesting significant results
Looking at results multiple times but only reporting the "best" one

Why it's harmful

False positive rate inflates significantly — every peek is an implicit hypothesis test; multiple peeks push the actual α well above 0.05
Random fluctuation in p-values — during an experiment, p-values can oscillate between significant and non-significant; any single snapshot does not represent the final result

Key insight: The problem with peeking is "optional stopping" — you are not choosing a random point in time to read results; you are stopping because the results look significant. This is no different in principle from flipping a coin repeatedly and only reporting the round where all flips came up heads.

Solutions

Option 1: Adhere strictly to a fixed-sample design
- Pre-calculate required sample size and run duration
- Do not look at results before the target is reached (or if you do, make no decisions based on them)
- Conduct one final analysis once the target is reached
Option 2: Establish strict monitoring rules
- For day-to-day monitoring, watch only guardrail metrics (to catch serious problems) — do not check the significance of primary metrics
- Check the significance of primary metrics only at pre-specified analysis time points

2. SRM (Sample Ratio Mismatch) goes undetected

What it is

The experiment is configured as 50/50, but actual data shows a 60/40 or similar imbalance
No SRM check is run, or an SRM alert fires but is ignored

Why it's harmful

All analytical conclusions are untrustworthy — SRM means randomization has been broken; the treatment group and control group are no longer comparable
Direction of bias is unpredictable — the true effect may be over- or under-estimated
The most hidden problem — the data looks "normal," metrics show significant differences, but the root cause is data collection bias, not a product effect

Solutions

Every experiment must run an SRM check as a required pre-launch validation (see Sample Ratio Mismatch under the statistics engine)
When SRM is detected, fix the root cause and restart the experiment — using contaminated data for decisions is not recommended
Common fixes: fix crash bugs, adjust the exposure trigger point, correct assignment logic

3. Code bugs during execution introduce confounding variables

What it is

The experiment feature code has a bug, causing the treatment group user experience to differ from what was intended
The experiment code itself introduces performance issues (slower loading, rendering anomalies)
The experiment assignment logic is wrong; some users are not actually placed in the correct group
Instrumentation code errors cause events to be dropped or double-counted

Why it's harmful

You are testing something other than what you intended — you think you are measuring the new feature's effect; you are actually measuring the effect of the bug
Performance issues mask the real effect — the feature itself may be good, but 2 extra seconds of loading time degrades the entire user experience
Assignment errors cause SRM — some "treatment group" users are actually seeing the control experience

Solutions

Test thoroughly before launch, simulating real user flows
Use automated integration tests to validate experiment functionality
In the early launch phase, ramp traffic gradually (e.g., 1%) and confirm no critical bugs before increasing allocation
Monitor in real time technical metrics such as crash rate, load time, and error rate

4. Incorrect exposure trigger point

What it is

The treatment group triggers exposure deep in the page; the control group triggers it at the top
Exposure timing is misaligned with the user's actual experience
The exposure event is reported before the user has actually seen the experiment feature
Under certain conditions the exposure event fails silently (e.g., on poor network connections)

Why it's harmful

Introduces selection bias — the control group includes all visitors; the treatment group includes only users who "successfully reached" the feature; the two groups are no longer comparable
Metric calculation errors — inconsistent denominators make conversion rates and other metrics incomparable
Causes SRM — the two groups report inconsistent exposure counts

Typical scenario: The new feature in the treatment group is at the bottom of the page; users must scroll to see it. Exposure fires when the feature renders, while the control group's exposure fires when the page loads. Result: the control group's "exposed users" include everyone; the treatment group includes only users who scrolled to the bottom — who are already more engaged users.

Solutions

Align the exposure trigger timing — both groups' exposures should fire at the same user-behavior event
If the experiment feature requires user action to activate, the control group should also have a corresponding "virtual exposure" event
Ensure exposure events are reliably reported; add failure-retry logic
Regularly compare observed exposure counts against expected allocations to catch losses early

5. Reducing traffic during the experiment

Reducing traffic during an experiment can introduce several significant risks that may compromise the validity, reliability, and overall success of the experimental outcomes. It is crucial to understand these risks in detail.

Reduced statistical power — Lower traffic can lead to smaller sample sizes, which in turn reduces the statistical power of the experiment. This makes it more difficult to detect significant differences between the control and treatment groups, increasing the likelihood of Type II errors (false negatives).
Effect dilution — ABC accumulates user experiment data from the time of the user's first exposure. After reducing the experiment traffic, some users will no longer be influenced by the new product features but will still be counted in the experiment group, diluting the experiment effect.
Bias introduction — If traffic reduction is not uniformly applied across all groups, it can introduce selection bias. This can result in an unrepresentative sample that skews the results and undermines the experiment's validity.
Inconsistent user experience — Reducing traffic might lead to inconsistent user experiences, particularly if the reduction is not managed carefully. For instance, users who have already been exposed to the new product features may experience a degradation in user experience, which can affect user behavior and introduce additional variability into the results.

6. Changing parameters during the experiment

Changing experiment parameters during an online controlled experiment can introduce several risks that can compromise the integrity and validity of the experiment. Here are some detailed risks associated with changing experiment parameters.

Bias introduction — Participants who were exposed to different sets of parameters might become aware of the feature changes, which can influence their behavior and responses, leading to unknown bias.
Inconsistent user experience — Changing parameters might lead to inconsistent user experience, which can affect user behavior and introduce additional variability into the results.
Interpretation challenges — Changes in parameters can lead to ambiguous results that are difficult to interpret. It becomes challenging to determine whether observed effects are due to the parameter changes or the original experiment conditions.
Replication issues — Future replication of the experiment becomes difficult if the parameters were not consistent throughout the original experiment.

Analysis-phase risks

1. Cherry-picking — selecting only favorable results

What it is

Picking 3 favorable metrics out of 5, ignoring the 2 unfavorable ones
Treating a significant effect found only in one segment as the overall conclusion
Selecting "good" or "bad" numbers that are unrelated to the experiment hypothesis
Slicing the data repeatedly until some cut shows a significant result

Why it's harmful

Fabricated "success" stories — among 20 metrics you will always find 1 significant one by chance alone (5% probability); reporting only that one is misleading
Wrong product decisions — shipping a feature based on cherry-picked metrics may produce a net negative real-world effect
Erodes data credibility — over time, stakeholders stop trusting experiment conclusions

Solutions

Report all pre-specified primary metric results, regardless of outcome
Significant results should have a plausible causal explanation — statistically significant results that cannot be explained causally should be treated as likely false positives
Consistency across multiple independent metrics increases credibility — if several related metrics all point in the same direction, even borderline p-values are more convincing
Treat interesting segment findings discovered post-hoc as hypotheses for the next experiment, not conclusions from the current one

2. Ignoring novelty and learning effects

What it is

Novelty effect — users engage more with a new feature in the short term because it is new, but behavior reverts to baseline over time
Learning effect — a new feature requires time to adapt to; it performs poorly in the short term but may improve over the long term
Drawing conclusions from early experiment metrics only, without observing trend changes over time
Experiment duration is too short to distinguish short-term novelty from long-term impact

Why it's harmful

False positive conclusion — novelty effect makes the new feature look good, but the effect fades after launch
False negative conclusion — learning effect makes the new feature look bad initially, but it may benefit users long-term
Reversed decisions — metrics drop after launch, forcing a rollback and wasting development resources

Solutions

Run the experiment long enough — at least covering one full user-behavior cycle (typically ≥1 week)
Look at day-by-day metric trends — if the effect decays over time → likely a novelty effect; if it grows over time → likely a learning effect
Analyze differences between new and returning users — new users have no "prior experience" to compare against and are not subject to the novelty effect
Use retention and revisit metrics to supplement the assessment of long-term impact
For experiments at risk of novelty effects, consider staged rollouts with ongoing observation

Risky Procedures in Experiments ​

Design-phase risks ​

1. Starting an experiment without a clear hypothesis ​

2. Wrong randomization unit ​

3. Ignoring guardrail metrics and side effects ​

Execution-phase risks ​

1. Peeking — making decisions before the target sample size is reached ​

2. SRM (Sample Ratio Mismatch) goes undetected ​

3. Code bugs during execution introduce confounding variables ​

4. Incorrect exposure trigger point ​

5. Reducing traffic during the experiment ​

6. Changing parameters during the experiment ​

Analysis-phase risks ​

1. Cherry-picking — selecting only favorable results ​

2. Ignoring novelty and learning effects ​

Risky Procedures in Experiments

Design-phase risks

1. Starting an experiment without a clear hypothesis

2. Wrong randomization unit

3. Ignoring guardrail metrics and side effects

Execution-phase risks

1. Peeking — making decisions before the target sample size is reached

2. SRM (Sample Ratio Mismatch) goes undetected

3. Code bugs during execution introduce confounding variables

4. Incorrect exposure trigger point

5. Reducing traffic during the experiment

6. Changing parameters during the experiment

Analysis-phase risks

1. Cherry-picking — selecting only favorable results

2. Ignoring novelty and learning effects