Skip to content

Configure the Stats Engine and Validate Samples

This page answers one core question: how do you make the metric differences you see on the Results page trustworthy?

To make a number "trustworthy," two things must be true simultaneously:

  1. The statistical judgment is rigorous enough: the difference is not a product of random noise (control the false-positive rate)
  2. The samples themselves are comparable: the two variants being compared were balanced before the experiment began (control sampling bias)

ABC splits these two concerns into two tools: stats engine parameters handle item 1, and Experiment Backtrack validates item 2. Both are presented together on the Results tab; this page covers them end to end.

Scenario: A week later, the dashboard shows ad revenue at +1.2% — is this a real effect, or a weekend traffic-peak artifact? A P-value alone is not enough — you also need to confirm that the two variants' samples are comparable. This page walks through that complete judgment process.

Entry points

Stats engine parameters

Open the experiment's Results tab → click the Hypothesis testing parameters icon above the metrics table. A small panel opens with three settings (significance level, statistical power, FDR).

Hypothesis testing parameters panel

These settings affect the current view only — they do not change the experiment itself. Different people can view the same data with different thresholds without interfering with each other.

Backtrack tab

In the metadata bar shared by the Setup and Results tabs in the experiment detail page, there is an AB backtrack status label — this is the result of the platform's automated sample balance check. Click it to open the standalone backtrack page.

Control false positives: three core settings

1. Significance level

How strict the platform is when declaring a result "significant." Lower values are stricter — fewer false positives, but more data is needed to reach a conclusion.

OptionWhen to use
0.1%Almost no false positives; suitable for extremely high-traffic scenarios where near-certainty is required
1%High-cost decisions, such as repricing for paying players
5% (default)Standard choice; balances reliability and time-to-conclusion
10%Early exploration; useful for getting directional signals first

Game scenario: Testing a new diamond bundle price. Getting the price wrong means real money lost from paying players. Set to 1% — better to wait a few more days than to make a hasty decision.

2. Statistical power

The probability that the platform detects a real lift when one exists. Higher power means more sensitivity, but requires more data.

OptionWhen to use
60%Acceptable to miss small effects; only want to catch large changes
80% (default)Standard choice; detects most meaningful effects
85%Use when DAU is very high; can detect improvements as small as +0.5%

Game scenario: A hyper-casual game with 500K daily active users — sample size is not a bottleneck. Setting power to 85% allows you to detect even small retention improvements from onboarding optimizations.

3. FDR multiple comparison correction

When you examine many metrics at once, some may appear "significant" purely by chance. Enabling FDR correction prevents false positives.

OptionWhen to use
Off (default)Focused on 1–2 primary metrics; no correction needed
OnSimultaneously viewing 3 or more metrics; want to ensure that anything flagged as significant is genuinely real

Game scenario: Evaluating a new daily check-in reward by looking at D1 retention, D7 retention, DAU, ad impressions, and IAP revenue simultaneously — five metrics. The probability of at least one "significant" result by pure chance across five is non-trivial. After enabling correction, any result that remains significant can be trusted.

Reach conclusions faster: CUPED variance reduction

CUPED uses each player's pre-experiment behavioral data to filter out natural noise — allowing you to reach a conclusion with the same number of players in less time.

On the Results tab:

  • Each metric has a CUPED On / Off indicator
  • When at least one metric has CUPED enabled, a Variance Reduction: CUPED banner appears at the top
  • CUPED is on by default for all supported metrics; unsupported metric types show CUPED Off with a reason

Value for casual games: Player behavior is highly variable — some grind all weekend while others log in once a day. CUPED uses pre-experiment baselines to eliminate natural differences. Effect: an experiment may reach a conclusion in 5 days rather than 12.

See Variance Reduction for full details.

Sample comparability check: Experiment Backtrack

The previous three sections address the rigor of the statistical judgment. But rigorous judgment requires that "the samples are comparable in the first place" — if the Treatment and Control variants started with different user compositions (e.g., more heavy spenders ended up in Treatment), even a strict P-value is just putting a "significant" label on an illusion.

Experiment Backtrack is the backstop for this — it runs automatically on every experiment and answers one question:

Over the recent period, are the user compositions across variants truly comparable?

4 common usage scenarios

ScenarioBusiness contextWhat Backtrack tells you
Early-experiment assignment bias2 days after launch, Treatment_A retention is 5% above Control — looks like a winLabel Normal → the 5% lift is credible; non-Normal → may be a coincidental bias from the early assignment period
Drift during a long-running experimentExperiment runs 1 month; overall SRM passes, but Treatment unexpectedly reverses in the last weekSRM does not see recent drift; Backtrack covers only the last 3 / 7 days and can catch recent issues such as client bugs or uneven rollouts
Borderline result with uncertaintyP-value oscillates between 0.04 and 0.06Switch to the 7-Day window and check whether direction and magnitude are consistent with the 3-day view
Newly bound metric needs immediate verificationConcerned that the new and old metrics do not share the same sample baselineClick Rerun Backtrack to recompute immediately; results in a few minutes

Backtrack label states

Backtrack label states

DisplayMeaningWhat to do
- -No backtrack result available yet (new experiment or no exposures yet)Check whether a primary metric is bound and exposures have accumulated; if both are in place, wait a few hours
NormalThe most recent backtrack considers the variants to be balancedYou can trust the metrics table on the Results page

The AB backtrack field is underlined and clickable — click it to open the standalone backtrack page with detailed per-metric comparisons.

Reading all four signals together: the standard decision checklist

Whether an experiment result is "truly trustworthy" requires reading several signals on the Results page together:

SignalWhat it tells youWhat to do when it fails
AB backtrack label / pageWere the variants comparable over the last 3 / 7 days?Switch to the 7-Day view; do not rush to a decision when non-Normal
Sample Ratio Mismatch (SRM)Has the overall traffic split since the experiment started matched the configured allocation?SRM failed → check the allocation configuration and experiment environment
Stats engine parametersUnder the current thresholds, is the judgment of the difference rigorous enough?Tighten the significance level / enable FDR correction
SuggestionThe platform's synthesized recommendation for the next step (decide / wait / add metrics)Follow the guidance to fill in the missing data before re-evaluating

Decision path

Go through each item in order; all must pass before the result is "ready for a decision":

Pre-decision validation checklist

If any item fails, address it before making a decision.

When in doubt, use the defaults:

SettingDefaultWhen to adjust
Significance level5%Tighten to 1% for payment-related experiments; loosen to 10% for early exploration
Statistical power80%Raise to 85% when DAU is high to capture small lifts
FDR correctionOffTurn on when viewing 3 or more metrics simultaneously
CUPEDAuto-enabledAccept the platform's judgment by default; no manual adjustment needed