Sign in to save

Bookmark this page so you can find it later.

Sign in to save

Bookmark this page so you can find it later.

Linear regression and correlation describe relationships between two quantitative variables. This cheat sheet helps students read scatterplots, measure association, write prediction equations, and judge whether a model is reasonable. It is useful for homework, tests, labs, and data projects where students must connect calculations to real context. The main regression model is the least-squares line y^=a+bx\hat{y}=a+bx, where bb is the slope and aa is the intercept. Correlation rr measures the strength and direction of a linear relationship, while r2r^2 describes the proportion of variation explained by the model. Residuals, written e=yy^e=y-\hat{y}, show prediction error and help check whether a linear model fits the data well.

Key Facts

  • The least-squares regression line has the form y^=a+bx\hat{y}=a+bx, where y^\hat{y} is the predicted response, aa is the intercept, and bb is the slope.
  • The slope of the regression line is b=rsysxb=r\frac{s_y}{s_x}, where rr is correlation and sxs_x and sys_y are the sample standard deviations.
  • The intercept is a=yˉbxˉa=\bar{y}-b\bar{x}, so the regression line always passes through the point (xˉ,yˉ)(\bar{x},\bar{y}).
  • The correlation coefficient is r=1n1(xixˉsx)(yiyˉsy)r=\frac{1}{n-1}\sum \left(\frac{x_i-\bar{x}}{s_x}\right)\left(\frac{y_i-\bar{y}}{s_y}\right) and satisfies 1r1-1\le r\le 1.
  • A residual is ei=yiy^ie_i=y_i-\hat{y}_i, and positive residuals mean the actual value is above the predicted value.
  • The least-squares line minimizes the sum of squared residuals, written (yiy^i)2\sum (y_i-\hat{y}_i)^2.
  • The coefficient of determination is r2r^2, which gives the proportion of variation in yy explained by the linear relationship with xx.
  • Use a regression line for interpolation within the data range, but avoid extrapolation far outside the observed xx values.

Vocabulary

Scatterplot
A graph of paired quantitative data values (x,y)(x,y) used to show the form, direction, strength, and outliers in a relationship.
Correlation coefficient
The number rr that measures the direction and strength of a linear association between two quantitative variables.
Least-squares regression line
The line y^=a+bx\hat{y}=a+bx that minimizes the sum of squared residuals for a set of data.
Residual
A residual is the prediction error e=yy^e=y-\hat{y} for one data point.
Coefficient of determination
The value r2r^2 is the proportion of variation in the response variable explained by the regression model.
Extrapolation
Extrapolation is using a regression model to predict values outside the range of the original data.

Common Mistakes to Avoid

  • Using correlation to prove causation is wrong because a strong value of rr only shows linear association, not that one variable causes the other.
  • Forgetting the context of the slope is wrong because bb means the predicted change in yy for each increase of 11 unit in xx.
  • Mixing up yy and y^\hat{y} is wrong because yy is an observed value, while y^\hat{y} is a predicted value from the regression line.
  • Ignoring residual plots is wrong because a curved residual pattern suggests that a linear model may not be appropriate.
  • Extrapolating far beyond the data is wrong because the linear pattern may not continue outside the observed range of xx values.

Practice Questions

  1. 1 A regression line is y^=12+3.5x\hat{y}=12+3.5x. What is the predicted value of yy when x=8x=8, and what does the slope mean in context?
  2. 2 For one data point, y=42y=42 and y^=38\hat{y}=38. Find the residual e=yy^e=y-\hat{y} and explain whether the prediction was too high or too low.
  3. 3 A data set has r=0.82r=-0.82. Find r2r^2 and interpret what it says about the linear model.
  4. 4 A scatterplot shows a strong curved pattern, but the correlation is close to 00. Explain why a low correlation does not always mean there is no relationship.