Experimental Design & A/B Testing Lab
Design controlled experiments with random assignment, simulate data with configurable effect sizes and noise, and analyze results using two-sample t-tests, confidence intervals, and power analysis. Explore how sample size, blocking, and confounding variables affect your ability to detect real effects.
Guided Experiment: Does Sample Size Matter?
If you run the same experiment with n=10, n=30, and n=100 subjects per group, how do you predict the p-values and ability to detect a true effect will change?
Write your hypothesis in the Lab Report panel, then click Next.
Random Assignment
Controls
Compare click-through rates for a red vs green call-to-action button
Results
Run the experiment to see statistical analysis results.
Data Table
(0 rows)| # | Trial | Scenario | n Control | n Treatment | Mean Control | Mean Treatment | Difference | p-value | Significant? |
|---|
Reference Guide
Random Assignment
Random assignment ensures each subject has an equal chance of being placed in the control or treatment group. This eliminates systematic differences between groups, so any observed effect can be attributed to the treatment rather than confounding variables.
Without randomization, pre-existing differences between groups can bias results and produce spurious "significant" findings even when the treatment has no real effect.
Two-Sample t-Test (Welch's)
Welch's t-test compares the means of two independent groups without assuming equal variances.
The degrees of freedom are estimated using the Welch-Satterthwaite equation. A two-tailed p-value below 0.05 suggests the group means differ significantly.
Cohen's d Effect Size
Cohen's d measures the standardized difference between two group means, independent of sample size.
Convention: |d| < 0.2 is negligible, 0.2-0.5 is small, 0.5-0.8 is medium, and > 0.8 is large. Unlike the p-value, effect size tells you how practically important the difference is.
Statistical Power
Statistical power is the probability of correctly detecting a true effect (rejecting the null hypothesis when it is false). It depends on three factors: effect size, sample size, and significance level.
A power of 80% (0.80) is the conventional minimum. Lower power means you are likely to miss real effects (Type II error). Run a power analysis before the experiment to determine the required sample size.