Statistical Methods
Confidence Intervals
Definition
A confidence interval quantifies the uncertainty around the observed metric change. A 95% confidence interval means: if you ran the experiment many times, 95% of the calculated intervals would contain the true effect.
Key interpretation: When the confidence interval does not include zero → the result is statistically significant at α = 0.05.
Calculation
Absolute change (two-sided):
Where:
= observed absolute change ( ) = 1.96 for a 95% confidence interval (two-sided) = standard deviation of the change =
Relative change: Delta method
P-values
Definition
A p-value is the probability of observing an effect equal to or larger than the measured metric change, assuming the null hypothesis is true (i.e., assuming there is no real difference between the two groups).
A p-value does not tell you the probability that the treatment works. It tells you how surprising the observed data would be if there were truly no effect.
How to interpret
| p-value | Interpretation |
|---|---|
| p < 0.01 | Strong evidence against H₀. Extremely unlikely to be random noise. |
| 0.01 ≤ p < 0.05 | Moderate evidence against H₀. Statistically significant at α = 0.05. |
| 0.05 ≤ p < 0.10 | Weak evidence. Not significant at α = 0.05, but directionally suggestive. |
| p ≥ 0.10 | Insufficient evidence to reject H₀. |
Statistical Power
Definition
Power (1 -
| Term | Definition | Typical value |
|---|---|---|
| Power (1- | P(reject H₀ | H₁ is true) | 0.80 (80%) |
| Type II error ( | P(fail to reject H₀ | H₁ is true) | 0.20 (20%) |
An experiment with 80% power has a 20% chance of missing a true effect of the specified size.
Power formula
Power, MDE, sample size, and significance level are interrelated. Given any three, the fourth is determined:
Where:
= true effect size (absolute value) = standard error of the change = = significance threshold (1.96 at α = 0.05) = standard normal cumulative distribution function
Factors affecting power
| Factor | Effect on power | Mitigation |
|---|---|---|
| ↑ Sample size (N) | ↑ Power | Run longer or allocate more traffic |
| ↑ Effect size (Δ) | ↑ Power | Expect a larger impact (or only accept detecting large effects) |
| ↓ Variance | ↑ Power | Apply CUPED, winsorization |
| ↑ Significance level (α) | ↑ Power | Accept a higher false positive rate (trade-off) |
| One-sided test | ↑ Power | Use only when direction is pre-specified and well justified |
MDE (Minimum Detectable Effect)
Definition
The MDE is the smallest true effect size that an experiment can reliably detect at a given power level. It answers the question: "How small a change can this experiment detect?"
- An MDE of 1% at 80% power means: if the true effect is exactly 1%, the experiment has an 80% chance of declaring it significant.
- True effects smaller than the MDE are unlikely to be detected (underpowered).
- MDE decreases as sample size increases — the longer you run the experiment, the smaller the effects you can detect.
How to use: interpreting non-significant results
When an experiment yields p ≥ α (not significant), the result is ambiguous: it could mean there is no true effect, or simply that the experiment lacked the power to detect it. The MDE resolves this ambiguity.
Decision logic:
Compare the MDE with the pre-specified target effect size (the minimum effect you care about):
| p-value | Post-hoc MDE vs. target | Interpretation | Action |
|---|---|---|---|
| p < α | — | Significant effect detected | Proceed to decision (ship / no ship) |
| p ≥ α | MDE ≤ target | Adequately powered, no effect | Accept the null. Treatment is ineffective. |
| p ≥ α | MDE > target | Underpowered | Inconclusive. Extend the experiment or increase sample size. |
Delta Method
Definition
The Delta method is used to compute the variance of ratio metrics — metrics defined as the ratio of two correlated random variables (e.g., clicks per session, revenue per purchase).
Why it is needed
The standard variance formula assumes the numerator and denominator are independent. For ratio metrics, however, both come from the same users, introducing a correlation that must be accounted for.
Example: "Clicks per session" — clicks and sessions come from the same user pool. Users with more sessions naturally generate more clicks. Ignoring this correlation underestimates variance.
Calculation
For ratio
Var(R̄) ≈ (X̄/Ȳ)² × [Var(X̄)/X̄² + Var(Ȳ)/Ȳ² - 2·Cov(X̄,Ȳ)/(X̄·Ȳ)]Expanding:
Var(R̄) ≈ (1/Ȳ²) × [Var(X̄) + R̄² · Var(Ȳ) - 2R̄ · Cov(X̄, Ȳ)]Another use of the Delta method: relative lift
When computing confidence intervals for relative change (percentage change), the Delta method provides a heuristic approximation:
Var(Δ X̄ %) ≈ Var(Δ X̄) / X̄_control²This approximation converges as the population grows larger.