Linear regression and correlation describe relationships between two quantitative variables. This cheat sheet helps students read scatterplots, measure association, write prediction equations, and judge whether a model is reasonable. It is useful for homework, tests, labs, and data projects where students must connect calculations to real context. The main regression model is the least-squares line y^=a+bx\hat{y}=a+bx, where bb is the slope and aa is the intercept. Correlation rr measures the strength and direction of a linear relationship, while r2r^2 describes the proportion of variation explained by the model. Residuals, written e=yy^e=y-\hat{y}, show prediction error and help check whether a linear model fits the data well.

Key Facts

  • The least-squares regression line has the form y^=a+bx\hat{y}=a+bx, where y^\hat{y} is the predicted response, aa is the intercept, and bb is the slope.
  • The slope of the regression line is b=rsysxb=r\frac{s_y}{s_x}, where rr is correlation and sxs_x and sys_y are the sample standard deviations.
  • The intercept is a=yˉbxˉa=\bar{y}-b\bar{x}, so the regression line always passes through the point (xˉ,yˉ)(\bar{x},\bar{y}).
  • The correlation coefficient is r=1n1(xixˉsx)(yiyˉsy)r=\frac{1}{n-1}\sum \left(\frac{x_i-\bar{x}}{s_x}\right)\left(\frac{y_i-\bar{y}}{s_y}\right) and satisfies 1r1-1\le r\le 1.
  • A residual is ei=yiy^ie_i=y_i-\hat{y}_i, and positive residuals mean the actual value is above the predicted value.
  • The least-squares line minimizes the sum of squared residuals, written (yiy^i)2\sum (y_i-\hat{y}_i)^2.
  • The coefficient of determination is r2r^2, which gives the proportion of variation in yy explained by the linear relationship with xx.
  • Use a regression line for interpolation within the data range, but avoid extrapolation far outside the observed xx values.

Vocabulary

Scatterplot
A graph of paired quantitative data values (x,y)(x,y) used to show the form, direction, strength, and outliers in a relationship.
Correlation coefficient
The number rr that measures the direction and strength of a linear association between two quantitative variables.
Least-squares regression line
The line y^=a+bx\hat{y}=a+bx that minimizes the sum of squared residuals for a set of data.
Residual
A residual is the prediction error e=yy^e=y-\hat{y} for one data point.
Coefficient of determination
The value r2r^2 is the proportion of variation in the response variable explained by the regression model.
Extrapolation
Extrapolation is using a regression model to predict values outside the range of the original data.

Common Mistakes to Avoid

  • Using correlation to prove causation is wrong because a strong value of rr only shows linear association, not that one variable causes the other.
  • Forgetting the context of the slope is wrong because bb means the predicted change in yy for each increase of 11 unit in xx.
  • Mixing up yy and y^\hat{y} is wrong because yy is an observed value, while y^\hat{y} is a predicted value from the regression line.
  • Ignoring residual plots is wrong because a curved residual pattern suggests that a linear model may not be appropriate.
  • Extrapolating far beyond the data is wrong because the linear pattern may not continue outside the observed range of xx values.

Practice Questions

  1. 1 A regression line is y^=12+3.5x\hat{y}=12+3.5x. What is the predicted value of yy when x=8x=8, and what does the slope mean in context?
  2. 2 For one data point, y=42y=42 and y^=38\hat{y}=38. Find the residual e=yy^e=y-\hat{y} and explain whether the prediction was too high or too low.
  3. 3 A data set has r=0.82r=-0.82. Find r2r^2 and interpret what it says about the linear model.
  4. 4 A scatterplot shows a strong curved pattern, but the correlation is close to 00. Explain why a low correlation does not always mean there is no relationship.