All Labs

Experimental Design & A/B Testing Lab

Design controlled experiments with random assignment, simulate data with configurable effect sizes and noise, and analyze results using two-sample t-tests, confidence intervals, and power analysis. Explore how sample size, blocking, and confounding variables affect your ability to detect real effects.

Guided Experiment: Does Sample Size Matter?

If you run the same experiment with n=10, n=30, and n=100 subjects per group, how do you predict the p-values and ability to detect a true effect will change?

Write your hypothesis in the Lab Report panel, then click Next.

Random Assignment

Control (n=25)Treatment (n=25)

Controls

Compare click-through rates for a red vs green call-to-action button

Total Sample Size (n)50 (25 per group)
Effect Size (Cohen's d)0.5 (Medium)
Noise Multiplier1.0×
Random Seed42

Results

Run the experiment to see statistical analysis results.

Data Table

(0 rows)
#TrialScenarion Controln TreatmentMean ControlMean TreatmentDifferencep-valueSignificant?
0 / 500
0 / 500
0 / 500

Reference Guide

Random Assignment

Random assignment ensures each subject has an equal chance of being placed in the control or treatment group. This eliminates systematic differences between groups, so any observed effect can be attributed to the treatment rather than confounding variables.

Without randomization, pre-existing differences between groups can bias results and produce spurious "significant" findings even when the treatment has no real effect.

Two-Sample t-Test (Welch's)

Welch's t-test compares the means of two independent groups without assuming equal variances.

t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

The degrees of freedom are estimated using the Welch-Satterthwaite equation. A two-tailed p-value below 0.05 suggests the group means differ significantly.

Cohen's d Effect Size

Cohen's d measures the standardized difference between two group means, independent of sample size.

d=xˉ1xˉ2spooledd = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}

Convention: |d| < 0.2 is negligible, 0.2-0.5 is small, 0.5-0.8 is medium, and > 0.8 is large. Unlike the p-value, effect size tells you how practically important the difference is.

Statistical Power

Statistical power is the probability of correctly detecting a true effect (rejecting the null hypothesis when it is false). It depends on three factors: effect size, sample size, and significance level.

Power=P(reject H0H1 is true)\text{Power} = P(\text{reject } H_0 \mid H_1 \text{ is true})

A power of 80% (0.80) is the conventional minimum. Lower power means you are likely to miss real effects (Type II error). Run a power analysis before the experiment to determine the required sample size.