Statistical Methods

Confidence Intervals

Definition

A confidence interval quantifies the uncertainty around the observed metric change. A 95% confidence interval means: if you ran the experiment many times, 95% of the calculated intervals would contain the true effect.

Key interpretation: When the confidence interval does not include zero → the result is statistically significant at α = 0.05.

Calculation

Absolute change (two-sided):

C I = Δ \bar{X} \pm z_{\frac{α}{2}} \times σ_{Δ}

Where:

$Δ \bar{X}$ = observed absolute change ( ${\bar{X}}_{t r e a t m e n t} - {\bar{X}}_{c o n t r o l}$ )
$z_{α / 2}$ = 1.96 for a 95% confidence interval (two-sided)
$σ_{Δ}$ = standard deviation of the change = $\sqrt{Var ({\bar{X}}_{t r e a t m e n t}) + Var ({\bar{X}}_{c o n t r o l})}$

Relative change: Delta method

P-values

Definition

A p-value is the probability of observing an effect equal to or larger than the measured metric change, assuming the null hypothesis is true (i.e., assuming there is no real difference between the two groups).

A p-value does not tell you the probability that the treatment works. It tells you how surprising the observed data would be if there were truly no effect.

How to interpret

p-value	Interpretation
p < 0.01	Strong evidence against H₀. Extremely unlikely to be random noise.
0.01 ≤ p < 0.05	Moderate evidence against H₀. Statistically significant at α = 0.05.
0.05 ≤ p < 0.10	Weak evidence. Not significant at α = 0.05, but directionally suggestive.
p ≥ 0.10	Insufficient evidence to reject H₀.

Statistical Power

Definition

Power (1 - $β$ ) is the probability of correctly detecting a true effect when one exists. Equivalently, $β$ is the probability of a false negative (Type II error) — failing to detect a true effect.

Term	Definition	Typical value
Power (1- $β$ )	P(reject H₀ \| H₁ is true)	0.80 (80%)
Type II error ( $β$ )	P(fail to reject H₀ \| H₁ is true)	0.20 (20%)

An experiment with 80% power has a 20% chance of missing a true effect of the specified size.

Power formula

Power, MDE, sample size, and significance level are interrelated. Given any three, the fourth is determined:

Power = Φ (\frac{| Δ |}{S E (Δ)} - Z_{1 - \frac{α}{2}})

Where:

$| Δ |$ = true effect size (absolute value)
$S E (Δ)$ = standard error of the change = $\sqrt{Var ({\bar{X}}_{t r e a t m e n t}) + Var ({\bar{X}}_{c o n t r o l})}$
$Z_{1 - α / 2}$ = significance threshold (1.96 at α = 0.05)
$Φ$ = standard normal cumulative distribution function

Factors affecting power

Factor	Effect on power	Mitigation
↑ Sample size (N)	↑ Power	Run longer or allocate more traffic
↑ Effect size (Δ)	↑ Power	Expect a larger impact (or only accept detecting large effects)
↓ Variance	↑ Power	Apply CUPED, winsorization
↑ Significance level (α)	↑ Power	Accept a higher false positive rate (trade-off)
One-sided test	↑ Power	Use only when direction is pre-specified and well justified

MDE (Minimum Detectable Effect)

Definition

The MDE is the smallest true effect size that an experiment can reliably detect at a given power level. It answers the question: "How small a change can this experiment detect?"

An MDE of 1% at 80% power means: if the true effect is exactly 1%, the experiment has an 80% chance of declaring it significant.
True effects smaller than the MDE are unlikely to be detected (underpowered).
MDE decreases as sample size increases — the longer you run the experiment, the smaller the effects you can detect.

How to use: interpreting non-significant results

When an experiment yields p ≥ α (not significant), the result is ambiguous: it could mean there is no true effect, or simply that the experiment lacked the power to detect it. The MDE resolves this ambiguity.

Decision logic:

Compare the MDE with the pre-specified target effect size (the minimum effect you care about):

p-value	Post-hoc MDE vs. target	Interpretation	Action
p < α	—	Significant effect detected	Proceed to decision (ship / no ship)
p ≥ α	MDE ≤ target	Adequately powered, no effect	Accept the null. Treatment is ineffective.
p ≥ α	MDE > target	Underpowered	Inconclusive. Extend the experiment or increase sample size.

Delta Method

Definition

The Delta method is used to compute the variance of ratio metrics — metrics defined as the ratio of two correlated random variables (e.g., clicks per session, revenue per purchase).

Why it is needed

The standard variance formula assumes the numerator and denominator are independent. For ratio metrics, however, both come from the same users, introducing a correlation that must be accounted for.

Example: "Clicks per session" — clicks and sessions come from the same user pool. Users with more sessions naturally generate more clicks. Ignoring this correlation underestimates variance.

Calculation

For ratio $\bar{R} = \bar{X} / \bar{Y}$ :

Var(R̄) ≈ (X̄/Ȳ)² × [Var(X̄)/X̄² + Var(Ȳ)/Ȳ² - 2·Cov(X̄,Ȳ)/(X̄·Ȳ)]

Expanding:

Var(R̄) ≈ (1/Ȳ²) × [Var(X̄) + R̄² · Var(Ȳ) - 2R̄ · Cov(X̄, Ȳ)]

Another use of the Delta method: relative lift

When computing confidence intervals for relative change (percentage change), the Delta method provides a heuristic approximation:

Var(Δ X̄ %) ≈ Var(Δ X̄) / X̄_control²

This approximation converges as the population grows larger.

Statistical Methods ​

Confidence Intervals ​

Definition ​

Calculation ​

P-values ​

Definition ​

How to interpret ​

Statistical Power ​

Definition ​

Power formula ​

Factors affecting power ​

MDE (Minimum Detectable Effect) ​

Definition ​

How to use: interpreting non-significant results ​

Delta Method ​

Definition ​

Why it is needed ​

Calculation ​

Another use of the Delta method: relative lift ​

Statistical Methods

Confidence Intervals

Definition

Calculation

P-values

Definition

How to interpret

Statistical Power

Definition

Power formula

Factors affecting power

MDE (Minimum Detectable Effect)

Definition

How to use: interpreting non-significant results

Delta Method

Definition

Why it is needed

Calculation

Another use of the Delta method: relative lift