Skip to content

Statistical Methods

Confidence Intervals

Definition

A confidence interval quantifies the uncertainty around the observed metric change. A 95% confidence interval means: if you ran the experiment many times, 95% of the calculated intervals would contain the true effect.

Key interpretation: When the confidence interval does not include zero → the result is statistically significant at α = 0.05.

Calculation

Absolute change (two-sided):

CI=ΔX¯±zα2×σΔ

Where:

  • ΔX¯ = observed absolute change (X¯treatmentX¯control)
  • zα/2 = 1.96 for a 95% confidence interval (two-sided)
  • σΔ = standard deviation of the change = Var(X¯treatment)+Var(X¯control)

Relative change: Delta method

P-values

Definition

A p-value is the probability of observing an effect equal to or larger than the measured metric change, assuming the null hypothesis is true (i.e., assuming there is no real difference between the two groups).

A p-value does not tell you the probability that the treatment works. It tells you how surprising the observed data would be if there were truly no effect.

How to interpret

p-valueInterpretation
p < 0.01Strong evidence against H₀. Extremely unlikely to be random noise.
0.01 ≤ p < 0.05Moderate evidence against H₀. Statistically significant at α = 0.05.
0.05 ≤ p < 0.10Weak evidence. Not significant at α = 0.05, but directionally suggestive.
p ≥ 0.10Insufficient evidence to reject H₀.

Statistical Power

Definition

Power (1 - β) is the probability of correctly detecting a true effect when one exists. Equivalently, β is the probability of a false negative (Type II error) — failing to detect a true effect.

TermDefinitionTypical value
Power (1-β)P(reject H₀ | H₁ is true)0.80 (80%)
Type II error (β)P(fail to reject H₀ | H₁ is true)0.20 (20%)

An experiment with 80% power has a 20% chance of missing a true effect of the specified size.

Power formula

Power, MDE, sample size, and significance level are interrelated. Given any three, the fourth is determined:

Power=Φ(|Δ|SE(Δ)Z1α2)

Where:

  • |Δ| = true effect size (absolute value)
  • SE(Δ) = standard error of the change = Var(X¯treatment)+Var(X¯control)
  • Z1α/2 = significance threshold (1.96 at α = 0.05)
  • Φ = standard normal cumulative distribution function

Factors affecting power

FactorEffect on powerMitigation
↑ Sample size (N)↑ PowerRun longer or allocate more traffic
↑ Effect size (Δ)↑ PowerExpect a larger impact (or only accept detecting large effects)
↓ Variance↑ PowerApply CUPED, winsorization
↑ Significance level (α)↑ PowerAccept a higher false positive rate (trade-off)
One-sided test↑ PowerUse only when direction is pre-specified and well justified

MDE (Minimum Detectable Effect)

Definition

The MDE is the smallest true effect size that an experiment can reliably detect at a given power level. It answers the question: "How small a change can this experiment detect?"

  • An MDE of 1% at 80% power means: if the true effect is exactly 1%, the experiment has an 80% chance of declaring it significant.
  • True effects smaller than the MDE are unlikely to be detected (underpowered).
  • MDE decreases as sample size increases — the longer you run the experiment, the smaller the effects you can detect.

How to use: interpreting non-significant results

When an experiment yields p ≥ α (not significant), the result is ambiguous: it could mean there is no true effect, or simply that the experiment lacked the power to detect it. The MDE resolves this ambiguity.

Decision logic:

Compare the MDE with the pre-specified target effect size (the minimum effect you care about):

p-valuePost-hoc MDE vs. targetInterpretationAction
p < αSignificant effect detectedProceed to decision (ship / no ship)
p ≥ αMDE ≤ targetAdequately powered, no effectAccept the null. Treatment is ineffective.
p ≥ αMDE > targetUnderpoweredInconclusive. Extend the experiment or increase sample size.

Delta Method

Definition

The Delta method is used to compute the variance of ratio metrics — metrics defined as the ratio of two correlated random variables (e.g., clicks per session, revenue per purchase).

Why it is needed

The standard variance formula assumes the numerator and denominator are independent. For ratio metrics, however, both come from the same users, introducing a correlation that must be accounted for.

Example: "Clicks per session" — clicks and sessions come from the same user pool. Users with more sessions naturally generate more clicks. Ignoring this correlation underestimates variance.

Calculation

For ratio R¯=X¯/Y¯:

Var(R̄) ≈ (X̄/Ȳ)² × [Var(X̄)/X̄² + Var(Ȳ)/Ȳ² - 2·Cov(X̄,Ȳ)/(X̄·Ȳ)]

Expanding:

Var(R̄) ≈ (1/Ȳ²) × [Var(X̄) + R̄² · Var(Ȳ) - 2R̄ · Cov(X̄, Ȳ)]

Another use of the Delta method: relative lift

When computing confidence intervals for relative change (percentage change), the Delta method provides a heuristic approximation:

Var(Δ X̄ %) ≈ Var(Δ X̄) / X̄_control²

This approximation converges as the population grows larger.