6 A/B Testing and Experimental Design

Chapter 01.3 introduced hypothesis tests as tools for measuring whether an observed difference is real. This chapter is about designing the experiment that produces the data those tests will evaluate.

Design matters at least as much as analysis. A well-designed experiment with a modest sample is more informative than a poorly designed one with ten times the data. The mistakes that make experiments inconclusive — wrong randomization unit, insufficient power, early stopping, confounded groups — are all committed before a single observation is recorded.

We cover the anatomy of a controlled experiment, randomization strategies, variance reduction before launch, adaptive testing with bandits, multi-variate designs, and the practical question of which experiments are actually worth running.

6.1 The Anatomy of an Experiment

Every controlled experiment has four components we must define before collecting data.

The unit of randomization is the entity that gets assigned to a condition — a user, a session, a device, or a geographic region. The choice matters: if we randomize by session, the same user can see both variants, which contaminates the comparison.

The metric is the quantity we are trying to move. There is usually one primary metric (the decision criterion) and a set of guardrail metrics we promise not to harm. Defining these before the experiment prevents the temptation to pick the metric that happened to move.

The minimum detectable effect (MDE) is the smallest change we would actually act on. This drives the sample size calculation. Setting the MDE too small produces impractically large sample requirements; setting it too large misses real but modest improvements.

The assignment mechanism is how units are allocated to treatment and control. Random assignment is the defining feature of a controlled experiment; it eliminates confounding on both observed and unobserved variables in expectation.

6.2 Randomization Strategies

The most common assignment mechanism is simple random assignment: a hash of the unit ID (modulo bucket count) determines the group. Using a hash rather than a random number generator ensures deterministic, repeatable assignment — the same user always lands in the same bucket, regardless of when they visit.

Stratified assignment improves balance on key covariates. We partition users into strata (e.g., new vs. returning, mobile vs. desktop) and assign independently within each stratum. This is especially valuable for small experiments where simple randomization could, by chance, produce imbalanced groups.

Cluster randomization is necessary when units interact with each other. In a marketplace, showing different prices to buyers and sellers in the same market violates the Stable Unit Treatment Value Assumption (SUTVA): one unit’s outcome depends on another’s assignment. We instead randomize at the market or geographic level, accepting smaller effective sample sizes in exchange for cleaner causal identification.

Switchback designs (time-based randomization) alternate between conditions over time periods and are used when geo-level randomization is infeasible. They require careful autocorrelation correction in the analysis.

Code

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
np.random.seed(42)

# ---- Stratified random assignment ----
n = 1000
users = pd.DataFrame({
    "user_id":  range(n),
    "platform": np.random.choice(["mobile","desktop"], n, p=[0.65,0.35]),
    "tenure":   np.random.choice(["new","returning"], n, p=[0.40,0.60]),
})

# Assign within each stratum so balance is guaranteed
def stratified_assign(df, strata_cols, seed=0):
    rng = np.random.default_rng(seed)
    assignment = pd.Series("", index=df.index)
    for _, grp in df.groupby(strata_cols):
        idx = grp.index.tolist()
        rng.shuffle(idx)
        half = len(idx) // 2
        assignment[idx[:half]]  = "control"
        assignment[idx[half:]]  = "treatment"
    return assignment

users["group"] = stratified_assign(users, ["platform","tenure"])

print("Assignment balance by stratum:")
print(users.groupby(["platform","tenure","group"]).size().unstack())
print()
print("Overall split:", users["group"].value_counts().to_dict())

6.3 CUPED: Variance Reduction Without More Data

Larger samples increase power, but we can also increase power by reducing the variance of the outcome metric — without collecting more data. CUPED (Controlled-experiment Using Pre-Experiment Data) does this by regressing out variation that is predictable from a pre-experiment covariate.

If \(Y\) is the outcome metric and \(X\) is a pre-period observation of the same metric (e.g., purchase amount in the two weeks before the experiment), the adjusted outcome is:

\[Y^{\text{cuped}} = Y - \theta X, \quad \theta = \frac{\text{Cov}(Y, X)}{\text{Var}(X)}\]

Because \(\theta\) is estimated on the full dataset (not within groups), and because assignment is independent of \(X\), the adjusted estimator is unbiased. The variance reduction is proportional to \(\rho^2\) — the squared correlation between pre- and post-experiment metrics. For metrics like revenue-per-user, \(\rho\) is often 0.5–0.8, reducing required sample sizes by 25–64%.

CUPED is one of the highest-leverage practices in online experimentation. Booking.com, Netflix, and Microsoft use it as standard practice. The only requirement is a pre-experiment measurement of the outcome metric — usually available from historical logs.

Code

# CUPED implementation
np.random.seed(0)
n_exp = 500  # per group

# Pre-experiment metric (correlated with post)
pre_control   = np.random.normal(50, 15, n_exp)
pre_treatment = np.random.normal(50, 15, n_exp)

# Post-experiment metric: treatment effect = +3, noise correlated with pre
noise_c = 0.6 * pre_control   + np.random.normal(0, 12, n_exp)
noise_t = 0.6 * pre_treatment + np.random.normal(0, 12, n_exp)
post_control   = 50 + noise_c
post_treatment = 53 + noise_t

# Naive t-test
_, p_naive = stats.ttest_ind(post_treatment, post_control)

# CUPED adjustment
pre_all  = np.concatenate([pre_control, pre_treatment])
post_all = np.concatenate([post_control, post_treatment])
theta = np.cov(post_all, pre_all)[0,1] / np.var(pre_all)

cuped_control   = post_control   - theta * (pre_control   - pre_all.mean())
cuped_treatment = post_treatment - theta * (pre_treatment - pre_all.mean())

_, p_cuped = stats.ttest_ind(cuped_treatment, cuped_control)

print(f"Naive std (control):  {post_control.std():.2f}")
print(f"CUPED std (control):  {cuped_control.std():.2f}")
print(f"Variance reduction:   {1 - cuped_control.var()/post_control.var():.1%}")
print()
print(f"Naive p-value:  {p_naive:.4f}")
print(f"CUPED p-value:  {p_cuped:.4f}")
print()
rho = np.corrcoef(post_all, pre_all)[0,1]
print(f"Pre/post correlation (rho): {rho:.3f}")
print(f"Theoretical variance reduction: {rho**2:.1%}")

6.4 The Sequential Testing Problem

The most common mistake in online experimentation is peeking: checking the p-value daily and stopping the experiment as soon as \(p < 0.05\). This inflates the Type I error rate dramatically — a nominal 5% threshold becomes 30% or higher under repeated checking.

The root cause is that p-values are calibrated for a single look at the data. Taking multiple looks without correction means we are effectively running multiple tests, each of which has a chance of producing a false positive.

Three approaches handle this correctly.

Pre-commit and wait. Calculate the required sample size, run the experiment until it is reached, and analyze once. Simple and statistically valid, but inflexible.

Alpha spending (O’Brien-Fleming, Pocock). Divide the alpha budget across planned interim looks, using a stricter threshold at early looks (when less evidence has accumulated) and a threshold close to 0.05 at the final look. The scipy and statsmodels libraries support these.

Always-Valid Inference (mSPRT, sequential tests). These tests produce confidence sequences that remain valid at any stopping time. The anytime package in R and the safestats package in Python implement them. Several large tech companies now use sequential tests as their default.

Code

# Simulate peeking inflation vs. committed stopping
np.random.seed(7)
n_simulations = 2000
n_max         = 400   # max per group
check_every   = 20

false_positives_peek  = 0
false_positives_fixed = 0

for _ in range(n_simulations):
    # Null is true: both groups drawn from same distribution
    a = np.random.normal(0, 1, n_max)
    b = np.random.normal(0, 1, n_max)

    # Peeking: stop early if p < 0.05 at any interim look
    peeked_sig = False
    for n in range(check_every, n_max+1, check_every):
        _, p = stats.ttest_ind(a[:n], b[:n])
        if p < 0.05:
            peeked_sig = True
            break

    # Fixed: single look at n_max
    _, p_final = stats.ttest_ind(a, b)

    if peeked_sig:    false_positives_peek  += 1
    if p_final < 0.05: false_positives_fixed += 1

print(f"Type I error — fixed stopping:  {false_positives_fixed/n_simulations:.1%}  (target: 5%)")
print(f"Type I error — peeking:         {false_positives_peek/n_simulations:.1%}  (target: 5%)")
print()
print("Peeking inflates the false positive rate by ")
print(f"{false_positives_peek/false_positives_fixed:.1f}x relative to the nominal level.")

6.5 Multi-Armed Bandits

Classical A/B testing answers a question: is Variant B better than Variant A? Bandits solve a different problem: given several variants, minimize the total cost of exploring while exploiting the best option.

In a standard A/B test, we split traffic 50/50 until the experiment ends, then route all traffic to the winner. If one variant is clearly inferior, we still send it half the traffic for the duration. A bandit adaptively shifts traffic toward the variant that appears to be performing better, reducing regret during the experiment.

Thompson sampling is the most popular bandit algorithm for binary outcomes (click / no click, convert / no convert). Each variant maintains a Beta distribution over its true conversion rate. At each step, we sample from each variant’s Beta, and serve whichever variant has the highest sample. As evidence accumulates, the distributions narrow and the better variant receives proportionally more traffic.

The tradeoff: bandits optimize during the experiment but produce less clean causal estimates than A/B tests. Use bandits when the experiment period is long and regret during testing is costly (content recommendation, ad serving). Use A/B tests when a clean causal estimate is required (pricing changes, product redesigns, policy decisions).

Code

# Thompson Sampling simulation: 3 variants with true rates 5%, 8%, 12%
np.random.seed(15)
true_rates = [0.05, 0.08, 0.12]
n_rounds   = 5000
labels     = ["Variant A", "Variant B", "Variant C"]

# Beta(alpha, beta): alpha=successes+1, beta=failures+1
alphas = np.ones(3)
betas  = np.ones(3)
counts = np.zeros(3, dtype=int)
traffic_share = []

for t in range(1, n_rounds+1):
    samples = np.random.beta(alphas, betas)
    chosen  = np.argmax(samples)
    reward  = int(np.random.rand() < true_rates[chosen])
    alphas[chosen] += reward
    betas[chosen]  += (1 - reward)
    counts[chosen] += 1
    if t % 500 == 0:
        traffic_share.append(counts / counts.sum())

rounds_checked = [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000]
df_ts = pd.DataFrame(traffic_share, columns=labels,
                     index=[f"Round {r}" for r in rounds_checked])
print("Traffic allocation over time (Thompson Sampling):")
print(df_ts.applymap(lambda x: f"{x:.1%}"))
print()
print(f"Final traffic:  {[f"{l}: {c/n_rounds:.1%}" for l,c in zip(labels, counts)]}")

6.6 Multi-Variate Testing

Sometimes we want to test several changes simultaneously — a new headline, a different button color, and a revised call-to-action. A full factorial design tests every combination. With 3 binary factors, that is \(2^3 = 8\) cells.

The advantage is that factorial designs detect interaction effects — cases where the combination of changes produces an effect larger or smaller than their sum. Missing an interaction can lead to bad decisions: if a new headline works well only with the old button color, a sequential approach (test headline first, then button) will reach the wrong conclusion.

The disadvantage is sample size. Eight cells each require sufficient users to be powered, so the total sample is roughly 8x a standard A/B test. Fractional factorial designs test a carefully chosen subset of cells that still allows estimation of main effects, at the cost of confounding some interaction terms.

In practice, most teams use MVT sparingly — for landing pages, email templates, and ad creatives where many independent design decisions must be made quickly. For product features with complex logic, sequential A/B tests are more tractable.

Code

# 2x2 factorial design: headline (A/B) x button color (red/green)
np.random.seed(20)
n_per_cell = 300

# True conversion rates for each cell
# Headline B adds +2%, green button adds +1%, but interaction adds +2% extra
cells = {
    ("A","red"):   0.05,
    ("A","green"): 0.06,
    ("B","red"):   0.07,
    ("B","green"): 0.12,   # synergy between B + green
}

rows = []
for (headline, button), rate in cells.items():
    conversions = np.random.binomial(1, rate, n_per_cell)
    for v in conversions:
        rows.append({"headline": headline, "button": button, "converted": v})

df = pd.DataFrame(rows)

print("Observed conversion rates by cell:")
pivot = df.groupby(["headline","button"])["converted"].mean().unstack()
print(pivot.applymap(lambda x: f"{x:.2%}"))
print()
print("Main effect of headline:", f"{pivot.mean(axis=1).diff().iloc[1]:.2%}")
print("Main effect of button:  ", f"{pivot.mean(axis=0).diff().iloc[1]:.2%}")
print("Interaction (B+green vs additive):",
      f"{pivot.loc['B','green'] - pivot.loc['A','green'] - (pivot.loc['B','red'] - pivot.loc['A','red']):.2%}")

6.7 Network Effects and SUTVA

The Stable Unit Treatment Value Assumption (SUTVA) requires that one unit’s outcome is unaffected by another unit’s assignment. Many real-world settings violate this.

In a two-sided marketplace, showing a new pricing algorithm to some buyers affects sellers’s experience and vice versa. In a social network, showing a new feed algorithm to 50% of users affects what the other 50% see because content is shared between them. In ride-sharing, driver availability for one city segment depends on trips offered in adjacent segments.

Strategies for handling interference:

Geo-based randomization: randomize at the city or DMA level rather than the user level. Clean separation, but many fewer effective experimental units.
Ego-network clusters: in social networks, cluster users by their social graph so treatment group members are connected mostly to other treatment group members.
Holdout regions: keep entire geographic markets as holdouts rather than splitting within markets.
Switchback tests: alternate conditions over time blocks within the same market, then use time-series methods to estimate the effect.

Detecting interference: if the spillover is network-based, we can measure it by comparing outcomes for control users who have many treated neighbors vs. few treated neighbors. A difference suggests interference and warns that the naive estimate is biased.

6.8 Test Prioritization

Running experiments has a cost — engineering time to instrument, analytical time to evaluate, and opportunity cost of the traffic during the test. Not every experiment idea is worth running.

Two popular frameworks for prioritizing the backlog:

ICE: Impact (how much could this move the metric?), Confidence (how sure are we it will work?), Ease (how much engineering effort does it require?). Score each on a 1-10 scale and multiply. Quick, informal, useful for pruning obvious low-value tests.

RICE: Reach (how many users are affected per quarter?), Impact (rough scale: 0.25, 0.5, 1, 2, 3), Confidence (% expressed as a decimal), Effort (person-weeks). Score = (Reach × Impact × Confidence) / Effort. More rigorous and forces explicit business-case thinking.

Beyond scoring: a test is only worth running if the MDE is achievable given realistic traffic and timeline. Before committing, calculate the required sample size from the MDE, check whether available traffic supports it within a reasonable window, and confirm that the instrumentation to measure the metric is already in place. Many experiments fail not because the treatment did not work, but because the measurement was never valid to begin with.

A useful heuristic: if you would not change anything based on a result in the range of your MDE, the MDE is set wrong. Only run experiments where every possible outcome is actionable.

6.9 Key Takeaways

Define the unit of randomization, primary metric, and MDE before running the experiment
Use stratified assignment to guarantee covariate balance; use cluster randomization when SUTVA is violated
Apply CUPED when a pre-experiment measurement of the outcome metric is available — it routinely doubles power at no cost
Never peek: commit to the sample size before starting, or use a sequential testing method
Use bandits when regret during testing is costly; use A/B tests when a clean causal estimate is needed
Factorial designs detect interactions that sequential tests miss, but require proportionally larger samples
Prioritize experiments using RICE or ICE; confirm that MDE is achievable before instrumentation begins

Recommended reading: - Trustworthy Online Controlled Experiments — Kohavi, Tang, Xu (Cambridge, 2020) - Booking.com blog: “A/B testing at scale” - Netflix Tech Blog: “Improving Experimentation Efficiency at Netflix with Meta Analysis and Optimal Stopping”