Experimental Design & A/B Testing Lab

Design controlled experiments with random assignment, simulate data with configurable effect sizes and noise, and analyze results using two-sample t-tests, confidence intervals, and power analysis. Explore how sample size, blocking, and confounding variables affect your ability to detect real effects.

Guided Experiment: Does Sample Size Matter?

Hypothesis

Setup

Run Experiment

Analyze

Conclude

If you run the same experiment with n=10, n=30, and n=100 subjects per group, how do you predict the p-values and ability to detect a true effect will change?

Write your hypothesis in the Lab Report panel, then click Next.

Random Assignment

Control (n=25)Treatment (n=25)

Controls

Experiment Scenario

Compare click-through rates for a red vs green call-to-action button

Total Sample Size (n)50 (25 per group)

Effect Size (Cohen's d)0.5 (Medium)

Noise Multiplier1.0×

Enable Blocking(Age Group)Enable Confounder(biases assignment)

Random Seed42

Results

Run the experiment to see statistical analysis results.

Data Table

(0 rows)

#	Trial	Scenario	n Control	n Treatment	Mean Control	Mean Treatment	Difference	p-value	Significant?

Hypothesis

0 / 500

Observations

0 / 500

Conclusions

0 / 500

Reference Guide

Random Assignment

Random assignment ensures each subject has an equal chance of being placed in the control or treatment group. This eliminates systematic differences between groups, so any observed effect can be attributed to the treatment rather than confounding variables.

Without randomization, pre-existing differences between groups can bias results and produce spurious "significant" findings even when the treatment has no real effect.

Two-Sample t-Test (Welch's)

Welch's t-test compares the means of two independent groups without assuming equal variances.

t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

The degrees of freedom are estimated using the Welch-Satterthwaite equation. A two-tailed p-value below 0.05 suggests the group means differ significantly.

Cohen's d Effect Size

Cohen's d measures the standardized difference between two group means, independent of sample size.

d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}

Convention: |d| < 0.2 is negligible, 0.2-0.5 is small, 0.5-0.8 is medium, and > 0.8 is large. Unlike the p-value, effect size tells you how practically important the difference is.

Statistical Power

Statistical power is the probability of correctly detecting a true effect (rejecting the null hypothesis when it is false). It depends on three factors: effect size, sample size, and significance level.

\text{Power} = P(\text{reject } H_0 \mid H_1 \text{ is true})

A power of 80% (0.80) is the conventional minimum. Lower power means you are likely to miss real effects (Type II error). Run a power analysis before the experiment to determine the required sample size.