All Labs

Regression & Residual Diagnostics Lab

Fit regression models, examine residual plots, identify outliers and influential points, and discover why R² alone is never enough. Includes Anscombe's Quartet to see how identical statistics can hide completely different data patterns.

Guided Experiment: Why R² Isn't Enough

If four datasets have the same R² and the same regression line equation, do you think the regression model is equally valid for all four?

Write your hypothesis in the Lab Report panel, then click Next.

Scatter Plot with Regression Line

-0.12.55.27.810.513.1-0.15.210.415.620.826.1xy

Residual Plot

Click Run to see the residual plot

Controls

#xy
112.1
224.3
335.8
448.2
559.9
6612.1
7714.3
8815.8
9918
101020.2
111122.1
121223.9

Diagnostics

Click Run to fit the model and see diagnostics.

Data Table

(0 rows)
#DatasetModelAdj R²Residual PatternOutliersInfluential PointsRecommendation
0 / 500
0 / 500
0 / 500

Reference Guide

Residual Plots

A residual is the difference between the observed value and the predicted value.

ei=yiy^ie_i = y_i - \hat{y}_i

Random scatter around zero means the model fits well. Curved patterns suggest the model is missing a nonlinear term. Funnel shapes indicate heteroscedasticity (non-constant variance).

Leverage & Influential Points

Leverage measures how far a point's x-value is from the mean.

hi=1n+(xixˉ)2(xjxˉ)2h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum(x_j - \bar{x})^2}

Cook's distance combines leverage and residual size to find points that strongly influence the regression line.

Di=ei2hipMSE(1hi)2D_i = \frac{e_i^2 \cdot h_i}{p \cdot MSE \cdot (1 - h_i)^2}

Transformations for Linearity

When residuals show a curved pattern, applying a transformation can linearize the relationship.

  • log(y) works for exponential growth data
  • log(x) works for power-law relationships
  • √y and √x moderate right-skewed distributions
  • 1/x handles reciprocal relationships

After transforming, check whether the residual plot improves to random scatter.

Model Comparison (R² vs Adjusted R²)

R² always increases when you add more terms to a model, even if they add no real predictive power.

Radj2=1(1R2)(n1)npR^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p}

Adjusted R² penalizes for extra parameters (p). AIC and BIC provide further model comparison, with BIC applying a stronger penalty for model complexity.

Anscombe's Quartet demonstrates why you should never trust R² alone. All four datasets have nearly identical R² values yet completely different structures.