Regression & Residual Diagnostics Lab

Fit regression models, examine residual plots, identify outliers and influential points, and discover why R² alone is never enough. Includes Anscombe's Quartet to see how identical statistics can hide completely different data patterns.

Guided Experiment: Why R² Isn't Enough

Hypothesis

Setup

Run Experiment

Analyze

Conclude

If four datasets have the same R² and the same regression line equation, do you think the regression model is equally valid for all four?

Write your hypothesis in the Lab Report panel, then click Next.

Scatter Plot with Regression Line

Residual Plot

Click Run to see the residual plot

Controls

Preset Datasets

Regression Model

Transformation

Add Data Point (12 points)

Current Data

#	x	y
1	1	2.1
2	2	4.3
3	3	5.8
4	4	8.2
5	5	9.9
6	6	12.1
7	7	14.3
8	8	15.8
9	9	18
10	10	20.2
11	11	22.1
12	12	23.9

Diagnostics

Click Run to fit the model and see diagnostics.

Data Table

(0 rows)

#	Dataset	Model	R²	Adj R²	Residual Pattern	Outliers	Influential Points	Recommendation

Hypothesis

0 / 500

Observations

0 / 500

Conclusions

0 / 500

Reference Guide

Residual Plots

A residual is the difference between the observed value and the predicted value.

e_i = y_i - \hat{y}_i

Random scatter around zero means the model fits well. Curved patterns suggest the model is missing a nonlinear term. Funnel shapes indicate heteroscedasticity (non-constant variance).

Leverage & Influential Points

Leverage measures how far a point's x-value is from the mean.

h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum(x_j - \bar{x})^2}

Cook's distance combines leverage and residual size to find points that strongly influence the regression line.

D_i = \frac{e_i^2 \cdot h_i}{p \cdot MSE \cdot (1 - h_i)^2}

Transformations for Linearity

When residuals show a curved pattern, applying a transformation can linearize the relationship.

log(y) works for exponential growth data
log(x) works for power-law relationships
√y and √x moderate right-skewed distributions
1/x handles reciprocal relationships

After transforming, check whether the residual plot improves to random scatter.

Model Comparison (R² vs Adjusted R²)

R² always increases when you add more terms to a model, even if they add no real predictive power.

R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p}

Adjusted R² penalizes for extra parameters (p). AIC and BIC provide further model comparison, with BIC applying a stronger penalty for model complexity.

Anscombe's Quartet demonstrates why you should never trust R² alone. All four datasets have nearly identical R² values yet completely different structures.